Showing posts with label ray. Show all posts

Thursday, January 18, 2024

[FIXED] Parallelize DeepFace on multiple GPUs

January 18, 2024 deepface, gpu, ray, tensorflow, tensorflow2.0 No comments

Issue

I am trying to use the DeepFace python library to do face recognition and analysis on long videos: https://github.com/serengil/deepface.

Using the library out of the box, I am able to get desired results by selecting frames from a video and then iterating through a for loop.

Single GPU

import decord
import tensorflow as tf
from deepface import DeepFace

video_path = 'myvideopath'
vr = decord.VideoReader(video_path)

for i in range(0, 100, FRAME_STEP):
    image_bgr = vr[i].asnumpy()[:,:,::-1]
    results = DeepFace.find(img_path = image_bgr, **other_parameters)

This works, but is too slow for the amount of video and frames that I need to go through.

When running the model, I notice that it uses ~600 MB for prediction, so I should be able to run multiple instances on the same physical GPU. I am only using DeepFace for prediction and am not training or fine tuning any models.

gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
    try:
        tf.config.set_logical_device_configuration(gpu, [tf.config.LogicalDeviceConfiguration(memory_limit=630)] * 12)
    except RuntimeError as e:
        # Virtual devices must be set before GPUs have been initialized
        print(e)
    
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")

2 Physical GPU, 24 Logical GPUs

I would like to be able to parallelize the DeepFace.find and DeepFace.analyze functions.

The first thing that I tried to do was to have a queue of free gpu devices and use concurrent.futures.ThreadPoolExecutor.

def multigpu_helper(index, device_name, image_bgr, fn, fn_dict, q):
    print(f'{index:5} {device_name}')
    start_timer = timeit.default_timer()
    with tf.device(device_name):
        results = fn(img_path=image_bgr, **fn_dict)
    q.put(device_name)
    end_timer = timeit.default_timer()
    print(f'MultiGPU Time: {end_timer-start_timer} sec.')
    return results


def multigpu_process(iterable, vr, fn, fn_dict):
    logical_devices = tf.config.list_logical_devices(device_type='GPU')
    print(logical_devices)

    q = queue.Queue()
    for logical_device in logical_devices:
        q.put(logical_device.name)

    results_dict = dict()

    item_list = list(iterable)

    with concurrent.futures.ThreadPoolExecutor(max_workers=len(logical_devices)) as pool:
        future_jobs = dict()

        while item_list:
            device_name = q.get()
            index = item_list.pop(0)
            image_bgr = vr[index].asnumpy()[:, :, ::-1]
            future_jobs[pool.submit(multigpu_helper, index, device_name, image_bgr, fn, fn_dict, q)] = index

        for future in concurrent.futures.as_completed(future_jobs):
            index = future_jobs.get(future)
            results = future.result()
            results_dict[index] = results

    return results_dict

I am able to get the code to execute and output results, but it is no faster than doing it in a single for loop on a single GPU.

[LogicalDevice(name='/device:GPU:0', device_type='GPU'), LogicalDevice(name='/device:GPU:1', device_type='GPU'), LogicalDevice(name='/device:GPU:2', device_type='GPU'), LogicalDevice(name='/device:GPU:3', device_type='GPU'), LogicalDevice(name='/device:GPU:4', device_type='GPU'), LogicalDevice(name='/device:GPU:5', device_type='GPU'), LogicalDevice(name='/device:GPU:6', device_type='GPU'), LogicalDevice(name='/device:GPU:7', device_type='GPU'), LogicalDevice(name='/device:GPU:8', device_type='GPU'), LogicalDevice(name='/device:GPU:9', device_type='GPU'), LogicalDevice(name='/device:GPU:10', device_type='GPU'), LogicalDevice(name='/device:GPU:11', device_type='GPU'), LogicalDevice(name='/device:GPU:12', device_type='GPU'), LogicalDevice(name='/device:GPU:13', device_type='GPU'), LogicalDevice(name='/device:GPU:14', device_type='GPU'), LogicalDevice(name='/device:GPU:15', device_type='GPU'), LogicalDevice(name='/device:GPU:16', device_type='GPU'), LogicalDevice(name='/device:GPU:17', device_type='GPU'), LogicalDevice(name='/device:GPU:18', device_type='GPU'), LogicalDevice(name='/device:GPU:19', device_type='GPU'), LogicalDevice(name='/device:GPU:20', device_type='GPU'), LogicalDevice(name='/device:GPU:21', device_type='GPU'), LogicalDevice(name='/device:GPU:22', device_type='GPU'), LogicalDevice(name='/device:GPU:23', device_type='GPU')]
    0 /device:GPU:0
   30 /device:GPU:1
   60 /device:GPU:2
   90 /device:GPU:3
  120 /device:GPU:4
  150 /device:GPU:5
  180 /device:GPU:6
  210 /device:GPU:7
  240 /device:GPU:8
  270 /device:GPU:9
  300 /device:GPU:10
  330 /device:GPU:11
  360 /device:GPU:12
  390 /device:GPU:13
  420 /device:GPU:14
  450 /device:GPU:15
  480 /device:GPU:16
  510 /device:GPU:17
  540 /device:GPU:18
  570 /device:GPU:19
  600 /device:GPU:20
  630 /device:GPU:21
  660 /device:GPU:22
  690 /device:GPU:23
MultiGPU Time: 16.968208671023604 sec.
  720 /device:GPU:2
MultiGPU Time: 17.829027735977434 sec.
  750 /device:GPU:1
MultiGPU Time: 17.852755011990666 sec.
  780 /device:GPU:8
MultiGPU Time: 19.71368485200219 sec.MultiGPU Time: 19.543589979992248 sec.

MultiGPU Time: 19.8676836140221 sec.
  810 /device:GPU:4
MultiGPU Time: 19.85990399698494 sec.
  840 /device:GPU:11
  870 /device:GPU:0
MultiGPU Time: 20.076353634009138 sec.
  900 /device:GPU:6
  930 /device:GPU:3
MultiGPU Time: 20.145404886978213 sec.
MultiGPU Time: 20.27192261395976 sec.
  960 /device:GPU:9
  990 /device:GPU:7
MultiGPU Time: 20.459441539016552 sec.
MultiGPU Time: 20.418532160052564 sec.
MultiGPU Time: 20.581610807043035 sec.
MultiGPU Time: 20.545571406022646 sec.
MultiGPU Time: 20.832303048984613 sec.
MultiGPU Time: 20.97456920897821 sec.
MultiGPU Time: 20.994418176996987 sec.
MultiGPU Time: 21.35945221298607 sec.
MultiGPU Time: 21.50979186099721 sec.
MultiGPU Time: 21.405662977020256 sec.
MultiGPU Time: 21.542257393943146 sec.
MultiGPU Time: 22.063301149988547 sec.
MultiGPU Time: 21.665760322008282 sec.
MultiGPU Time: 22.105394209967926 sec.
MultiGPU Time: 6.661869053030387 sec.
MultiGPU Time: 9.814038792042993 sec.
MultiGPU Time: 7.658941667003091 sec.
MultiGPU Time: 8.546573753003031 sec.
MultiGPU Time: 10.831304075953085 sec.
MultiGPU Time: 9.250181486015208 sec.
MultiGPU Time: 8.87483947101282 sec.
MultiGPU Time: 12.432360459002666 sec.
MultiGPU Time: 9.511910478991922 sec.
MultiGPU Time: 9.66243519296404 sec.
Face Recognition MultiGPU Total Time: 29.63435428502271 sec.

In fact, a single GPU iteration of the DeepFace.find function in a for loop should take about 0.5 sec. It seems that the multithreading is causing all the threads to finish around their cumulative time which is slower and undesired.

I tried a second time without using a queue, just splitting the input indices into separate lists and then processing separately.

def cycle_baskets(items: List[Any], maxbaskets: int) -> List[List[Any]]:
    baskets = [[] for _ in range(min(maxbaskets, len(items)))]
    for item, basket in zip(items, cycle(baskets)):
        basket.append(item)
    return baskets


def multigpu_helper_split(device_name, item_list, video_path, fn, fn_dict):
    print(device_name)
    start_timer = timeit.default_timer()

    results_dict = dict()
    
    vr = decord.VideoReader(str(video_path))

    with tf.device(device_name):
        for index in item_list:
            start_index_timer = timeit.default_timer()

            image_bgr = vr[index].asnumpy()[:, :, ::-1]
            results_dict[index] = fn(img_path=image_bgr, **fn_dict)

            end_index_timer = timeit.default_timer()
            print(f'Device {device_name} Index {index:5} {end_index_timer - start_index_timer} sec.')

    end_timer = timeit.default_timer()
    print(f'MultiGPU Time: {end_timer - start_timer} sec.')
    return results_dict


def multigpu_process_split(iterable, video_path, fn, fn_dict):
    logical_devices = [device.name for device in tf.config.list_logical_devices(device_type='GPU')]
    print(logical_devices)

    results_dict = dict()

    item_lists = cycle_baskets(list(iterable), len(logical_devices))

    with concurrent.futures.ThreadPoolExecutor(max_workers=len(logical_devices)) as pool:
        future_jobs = {pool.submit(multigpu_helper_split, logical_devices[i], item_lists[i], video_path, fn, fn_dict) for i in range(len(logical_devices))}

        for future in concurrent.futures.as_completed(future_jobs):
            results_dict.update(future.result())

    return results_dict

This is also considerably slower and also caused the kernal to crash.

Device /device:GPU:18 Index   540 305.03293917299015 sec.
MultiGPU Time: 311.7356750360341 sec.
Device /device:GPU:22 Index   660 305.6161605300149 sec.
MultiGPU Time: 312.3281374910148 sec.
Device /device:GPU:5 Index   150 309.5672924729879 sec.
Device /device:GPU:13 Index   390 311.9252848789911 sec.
MultiGPU Time: 318.34215058299014 sec.
Device /device:GPU:0 Index     0 312.96517166896956 sec.
Device /device:GPU:3 Index    90 312.41818467900157 sec.
Device /device:GPU:4 Index   120 312.507540087041 sec.
Device /device:GPU:10 Index   300 312.49839297297876 sec.
MultiGPU Time: 319.4717267890228 sec.
Device /device:GPU:23 Index   690 313.53694368101424 sec.
MultiGPU Time: 320.6566755659878 sec.

I realize that with tf.device(device_name): is over the entire DeepFace function. Looking at the DeepFace source code, it looks like there is quite a lot more than tensorflow and what I really would want to parallelize is model.predict().

DeepFace.py

def represent():
...
# represent
        if "keras" in str(type(model)):
            # new tf versions show progress bar and it is annoying
            embedding = model.predict(img, verbose=0)[0].tolist()
        else:
            # SFace and Dlib are not keras models and no verbose arguments
            embedding = model.predict(img)[0].tolist()

How would I be able to parallelize the DeepFace.find and DeepFace.analyze functions to run on 24 logical GPUs that I have? I would like to be able to get a x24 speedup for processing the selected frames.

It would be much preferred if I could wrap something around the DeepFace functions themselves, but if that is not possible, then I could try to parallelize the source code of the DeepFace library.

Solution

I was able to parallelize DeepFace by parallelizing some of the internal functions using ray.

Answered By - jameszp

[FIXED] Is there a way to prevent ray.init() from hanging when using Python on Apple silicon (the M1 Max)?

November 28, 2022 apple-m1, mini-forge, python, ray, tensorflow No comments

Issue

So I am trying to run ray[rllib] in a Jupyter notebook (in a Miniforge virtual environment) on Apple silicon (the M1 Max). Although I can import ray normally into the notebook, the very next step (of running ray.init()) causes the notebook to hang. No error is returned--ray.init() never completes. Is there a fix for this?

This is my first time using Ray. I don't think the notebook or the commands I am entering is the issue because the notebook came pre-made from an instructor, and I have managed to get an identical notebook to run normally in a Miniforge environment on Windows 10.

I followed advice from developers at Ray M1 Mac (Apple Silicon) Support to install Miniforge for the M1 and create a virtual environment. I also leveraged this thread What is the proper way to install TensorFlow on Apple M1 in 2022 to devise a strategy for installing applications I need for a reinforcement learning application. Here are the contents of an environment.yml file I used to set up the Miniforge virtual environment:

name: tf-metal
channels:
  - apple
  - conda-forge
dependencies:
  - python=3.9
  - gym-all=0.21.0
  - pip
  - tensorflow-deps

  ## uncommented for use with Jupyter
  - ipykernel

  ## PyPI packages
  - pip:
    - jupyterlab
    - ray[rllib]==1.11
    - tensorflow-macos
    - tensorflow-metal

The steps I used in Terminal for creating the virtual environment were these:

# Download Miniforge3-MacOSX-arm64.sh and make it executable:
chmod u+x ./Miniforge3-MacOSX-arm64.sh

# run Miniforge
./Miniforge3-MacOSX-arm64.sh
# (or update it) ./Miniforge3-MacOSX-arm64.sh -u

# accept terms and conditions...
# run 'conda init' by entering 'yes'
# configure conda (then close and reopen Terminal):
conda config --set auto_activate_base false
# confirm '~/.bash_profile' reflects miniforge settings
# good-to-go...

# set up virtual environment
conda create --name rl_course2  # (choose any name you want)
# confirm acceptability of location (enter 'yes')
# activate env:
conda activate rl_course2
# configure channels (settings recommended by an instructor)
conda config --env --add channels conda-forge
conda config --env --set channel_priority strict
# install dependencies using environment.yml file shown above:
conda env update --name rl_course2 --file '/Users/.../environment.yml'
# check output for errors...(none found via text search)

So I created the virtual environment and installed all the dependencies with no errors, as far as I could tell:

Successfully installed MarkupSafe-2.1.1 PyWavelets-1.4.1 Send2Trash-1.8.0 absl-py-1.3.0 anyio-3.6.2 argon2-cffi-21.3.0 argon2-cffi-bindings-21.2.0 astunparse-1.6.3 async-timeout-4.0.2 attrs-22.1.0 babel-2.11.0 beautifulsoup4-4.11.1 bleach-5.0.1 cachetools-5.2.0 certifi-2022.9.24 cffi-1.15.1 charset-normalizer-2.1.1 click-8.1.3 contourpy-1.0.6 cycler-0.11.0 defusedxml-0.7.1 dm-tree-0.1.7 fastjsonschema-2.16.2 filelock-3.8.0 flatbuffers-22.10.26 fonttools-4.38.0 gast-0.4.0 google-auth-2.14.1 google-auth-oauthlib-0.4.6 google-pasta-0.2.0 grpcio-1.43.0 idna-3.4 imageio-2.22.4 importlib-metadata-5.0.0 ipython-genutils-0.2.0 jinja2-3.1.2 json5-0.9.10 jsonschema-4.17.1 jupyter-server-1.23.3 jupyterlab-3.5.0 jupyterlab-pygments-0.2.2 jupyterlab-server-2.16.3 keras-2.10.0 keras-preprocessing-1.1.2 kiwisolver-1.4.4 libclang-14.0.6 markdown-3.4.1 matplotlib-3.6.2 mistune-2.0.4 msgpack-1.0.4 nbclassic-0.4.8 nbclient-0.7.0 nbconvert-7.2.5 nbformat-5.7.0 networkx-2.8.8 notebook-6.5.2 notebook-shim-0.2.2 oauthlib-3.2.2 opt-einsum-3.3.0 pandas-1.5.1 pandocfilters-1.5.0 pillow-9.3.0 prometheus-client-0.15.0 pyasn1-0.4.8 pyasn1-modules-0.2.8 pycparser-2.21 pyrsistent-0.19.2 pytz-2022.6 pyyaml-6.0 ray-1.11.0 redis-4.3.5 requests-2.28.1 requests-oauthlib-1.3.1 rsa-4.9 scikit-image-0.19.3 sniffio-1.3.0 soupsieve-2.3.2.post1 tabulate-0.9.0 tensorboard-2.10.1 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.1 tensorboardX-2.5.1 tensorflow-estimator-2.10.0 tensorflow-macos-2.10.0 tensorflow-metal-0.6.0 termcolor-2.1.1 terminado-0.17.0 tifffile-2022.10.10 tinycss2-1.2.1 tomli-2.0.1 typing-extensions-4.4.0 urllib3-1.26.12 webencodings-0.5.1 websocket-client-1.4.2 werkzeug-2.2.2 wrapt-1.14.1 zipp-3.10.0

Last step (while working in the rl_course2 environment) using Terminal: launch Jupyter...

(rl_course2) MacBook-Pro ~$ jupyter notebook

Now, in the Jupyter/Python notebook (Chrome browser):

import ray   # works!
ray.init()   # never completes (no errors)!

So I tried similar steps in the same environment using Terminal (no notebook):

(rl_course2) MacBook-Pro ~$ python3
Python 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:48:25) 
[Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
>>> import ray
>>> ray.init()
[no errors, but never completes]

Is there a way to fix this and run Ray normally in my Jupyter environment?

Update 1: Just now, I was able to run the simple TensorFlow test script recommended by Apple (see Get started with tensorflow-metal) using the virtual environment discussed above, and five epochs of training completed with no errors in about two minutes on an M1 Max with 64 GB of memory, so the environment appears to be working fine. Perhaps the issue involves Ray?

Solution

I have found one of possibly several answers to my question. Changing the environment.yml file (described above) slightly to import ray[rllib] rather than ray[rllib]==1.11 enabled Jupyter notebook to run ray.init() normally and execute the remainder of the code in the notebook. It appears there was a bug in ray[rllib] version 1.11 that prevented ray.init() from running on the M1 Max under some circumstances.

So to summarize: to overcome a hang involving ray.init() on Apple Silicon (M1 Max), I was able to solve it by modifying the environment.yml file to this:

name: tf-metal
channels:
  - apple
  - conda-forge
dependencies:
  - python=3.9
  - gym-all=0.21.0
  - pip
  - tensorflow-deps

  ## uncommented for use with Jupyter
  - ipykernel

  ## PyPI packages
  - pip:
    - jupyterlab
    - ray[rllib]
    - tensorflow-macos
    - tensorflow-metal

I subsequently created a Miniforge environment using the procedure described above. Python version 3.9.15 and Ray version 2.1.0 were installed in the notebook automatically, and the notebook ran normally on the M1 Max.

Answered By - hackr

[FIXED] Pytorch and ray tune: why the error; raise TuneError("Trials did not complete", incomplete_trials)?

August 23, 2022 python, pytorch, ray No comments

Issue

I want to embed hyperparameter optimisation with ray into my pytorch script.

I wrote this code (which is a reproducible example):

## Standard libraries
CHECKPOINT_PATH = "/home/ad1/new_dev_v1"
DATASET_PATH = "/home/ad1/"
import torch
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
from importlib import reload
from itertools import *
import matplotlib
from itertools import groupby
from libs_public.api import get_pantry_token
from matplotlib import pyplot as plt
from matplotlib.colors import to_rgb
from openbabel import pybel
from openbabel.pybel import readstring,descs
from operator import itemgetter
from pathlib import Path
from pytorch_lightning.callbacks import LearningRateMonitor, ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger
from ray import tune
from ray.tune import CLIReporter
from ray.tune.integration.pytorch_lightning import TuneReportCallback, TuneReportCheckpointCallback
from ray.tune.schedulers import ASHAScheduler, PopulationBasedTraining
from sklearn import preprocessing
from sklearn.metrics import f1_score, precision_score, recall_score,roc_auc_score
from socket import TIPC_DEST_DROPPABLE
from torch.nn import Linear
from torch.utils.data import TensorDataset
from torch_geometric.data import Data, Dataset,DataLoader,DenseDataLoader,InMemoryDataset
from torch_geometric.datasets import TUDataset
from torch_geometric.nn import GCNConv
from torch_geometric.nn import global_mean_pool
from torchmetrics.functional import precision_recall
from torchvision import transforms
from torchvision.datasets import CIFAR10
from tqdm.notebook import tqdm
import getpass, argparse
import joblib
import json
import logging
import math
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np 
import openbabel
import os
import pandas as pd
import pytorch_lightning as pl
import random
import re
import requests
import seaborn as sns
import sklearn
import sys
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as data
import torch_geometric
import torch_geometric.data as geom_data
import torch_geometric.nn as geom_nn
import torchmetrics
import torchvision
import warnings
matplotlib.rcParams['lines.linewidth'] = 2.0
pl.seed_everything(42)
print(device)
sns.reset_orig()
sns.set()
sys.path.append('/home/ad1/git/')
torch.backends.cudnn.deterministic = True
warnings.filterwarnings('ignore')



# Setting the seed
pl.seed_everything(42)

# Ensure that all operations are deterministic on GPU (if used) for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
print(device)


import torch
from torch_geometric.datasets import TUDataset
from torch.nn import Linear
from torch_geometric.nn import global_mean_pool
from torch_geometric.data import Data, Dataset,DataLoader


from torch.utils.data import TensorDataset
from ray import tune
from ray.tune import CLIReporter
from ray.tune.schedulers import ASHAScheduler, PopulationBasedTraining
from ray.tune.integration.pytorch_lightning import TuneReportCallback, TuneReportCheckpointCallback

dataset = TUDataset(root='/tmp/MUTAG', name='MUTAG', use_node_attr=True)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

train_dataset = dataset
val_dataset = dataset
test_dataset = dataset

graph_train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
graph_val_loader = DataLoader(val_dataset, batch_size=64) # Additional loader if you want to change to a larger dataset
graph_test_loader = DataLoader(test_dataset, batch_size=64)


#will change this when it makes sense
#config = {
#    "dropout": tune.uniform(0.4,0.5)
#    } 

config = {'dropout':0.4}

gnn_layer_by_name = {
    "GCN": geom_nn.GCNConv,
    "GAT": geom_nn.GATConv,
    "GraphConv": geom_nn.GraphConv
}

class GCNLayer(nn.Module):
    def __init__(self, c_in, c_out):
        super().__init__()
        self.projection = nn.Linear(c_in, c_out)
        

    def forward(self, node_feats, adj_matrix):
        """
        Inputs:
            node_feats - Tensor with node features of shape [batch_size, num_nodes, c_in]
            adj_matrix - Batch of adjacency matrices of the graph. If there is an edge from i to j, adj_matrix[b,i,j]=1 else 0.
                         Supports directed edges by non-symmetric matrices. Assumes to already have added the identity connections. 
                         Shape: [batch_size, num_nodes, num_nodes]
        """
        # Num neighbours = number of incoming edges
        num_neighbours = adj_matrix.sum(dim=-1, keepdims=True)
        node_feats = self.projection(node_feats)
        node_feats = torch.bmm(adj_matrix, node_feats)
        node_feats = node_feats / num_neighbours
        return node_feats

class GNNModel(nn.Module):
    
    def __init__(self, c_in, c_hidden, c_out, num_layers=2, layer_name="GCN", dp_rate=config['dropout'], **kwargs):
        """
        Inputs:
            c_in - Dimension of input features
            c_hidden - Dimension of hidden features
            c_out - Dimension of the output features. Usually number of classes in classification
            num_layers - Number of "hidden" graph layers
            layer_name - String of the graph layer to use
            dp_rate - Dropout rate to apply throughout the network
            kwargs - Additional arguments for the graph layer (e.g. number of heads for GAT)
        """
        super().__init__()
        gnn_layer = gnn_layer_by_name[layer_name]
        
        layers = []
        in_channels, out_channels = c_in, c_hidden
        for l_idx in range(num_layers-1):
            layers += [
                gnn_layer(in_channels=in_channels, 
                          out_channels=out_channels,
                          **kwargs),
                nn.ReLU(inplace=True),
                nn.Dropout(config['dropout'])
            ]
            in_channels = c_hidden
        layers += [gnn_layer(in_channels=in_channels, 
                             out_channels=c_out,
                             **kwargs)]
        self.layers = nn.ModuleList(layers)
    
    def forward(self, x, edge_index):
        """
        Inputs:
            x - Input features per node
            edge_index - List of vertex index pairs representing the edges in the graph (PyTorch geometric notation)
        """
        for l in self.layers:
            # For graph layers, we need to add the "edge_index" tensor as additional input
            # All PyTorch Geometric graph layer inherit the class "MessagePassing", hence
            # we can simply check the class type.
            if isinstance(l, geom_nn.MessagePassing):
                x = l(x, edge_index)
            else:
                x = l(x)
        return x



class GraphGNNModel(nn.Module):
    
    def __init__(self, c_in, c_hidden, c_out, dp_rate_linear=0.5, **kwargs):
        """
        Inputs:
            c_in - Dimension of input features
            c_hidden - Dimension of hidden features
            c_out - Dimension of output features (usually number of classes)
            dp_rate_linear - Dropout rate before the linear layer (usually much higher than inside the GNN)
            kwargs - Additional arguments for the GNNModel object
        """
        super().__init__()
        self.GNN = GNNModel(c_in=c_in, 
                            c_hidden=c_hidden, 
                            c_out=c_hidden, # Not our prediction output yet!
                            **kwargs)
        self.head = nn.Sequential(
            nn.Dropout(config['dropout']),
            nn.Linear(c_hidden, c_out)
        )

    def forward(self, x, edge_index, batch_idx):
        """
        Inputs:
            x - Input features per node
            edge_index - List of vertex index pairs representing the edges in the graph (PyTorch geometric notation)
            batch_idx - Index of batch element for each node
        """
        x = self.GNN(x, edge_index)
        x = geom_nn.global_mean_pool(x, batch_idx) # Average pooling
        x = self.head(x)
        return x


#see https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html
class GraphLevelGNN(pl.LightningModule):
    
    def __init__(self, **model_kwargs):
        super().__init__()
        # Saving hyperparameters
        self.save_hyperparameters()
        
        self.model = GraphGNNModel(**model_kwargs)
        self.loss_module = nn.BCEWithLogitsLoss() #if self.hparams.c_out == 1 else nn.CrossEntropyLoss()

    def forward(self, data, mode="train"):
        x, edge_index, batch_idx = data.x, data.edge_index, data.batch
        x = self.model(x, edge_index, batch_idx)
        x = x.squeeze(dim=-1)
        
        if self.hparams.c_out == 1:
            preds = (x > 0).float()
            data.y = data.y.float()
        else:
            preds = x.argmax(dim=-1)

        loss = self.loss_module(x, data.y)
        acc = (preds == data.y).sum().float() / preds.shape[0]
        f1 = f1_score(preds,data.y)  ##change f1/precision and recall was just testing
        precision = precision_score(preds,data.y)
        recall = recall_score(preds,data.y)
        #roc_auc = roc_auc_score(preds,data.y)  ##ADD THIS BACK IN
        return loss, acc, f1,precision, recall

    def configure_optimizers(self):
        optimizer = optim.SGD(self.parameters(),lr=0.1) # High lr because of small dataset and small model
        return optimizer

    def training_step(self, batch, batch_idx):
        loss, acc, _,_, _ = self.forward(batch, mode="train")
        self.log('train_loss', loss,on_epoch=True,logger=True)
        self.log('train_acc', acc,on_epoch=True,logger=True)
        #self.log('train_precision',precision_and_recall)
        return loss

    def validation_step(self, batch, batch_idx):
        loss, acc, _,_, _ = self.forward(batch, mode="val")
        self.log('val_acc', acc,on_epoch=True,logger=True)
        self.log('val_loss', loss,on_epoch=True,logger=True)

    def test_step(self, batch, batch_idx):
        loss, acc, f1,precision, recall = self.forward(batch, mode="test")
        self.log('test_acc', acc,on_epoch=True,logger=True)
        self.log('test_f1', f1,on_epoch=True,logger=True)
        self.log('test_precision', precision,on_epoch=True,logger=True)       
        self.log('test_recall', recall,on_epoch=True,logger=True) 
        #self.log('roc_auc', roc_auc,on_epoch=True,logger=True) 


from pytorch_lightning import loggers as pl_loggers
def train_graph_classifier(model_name, **model_kwargs):
    pl.seed_everything(42)
    
    # Create a PyTorch Lightning trainer with the generation callback
    root_dir = os.path.join(CHECKPOINT_PATH, "GraphLevel" + model_name)
    os.makedirs(root_dir, exist_ok=True)
    csv_logger = pl_loggers.CSVLogger(save_dir="logs/")

    tune_report_callback = TuneReportCheckpointCallback(
    metrics={
    "val_loss": "val_loss",
    "val_acc": "val_acc",
    },
    filename="ray_ckpt",
    on="validation_end",
    )

    trainer = pl.Trainer(default_root_dir=root_dir,
                         callbacks=[ModelCheckpoint(save_weights_only=True, mode="max", monitor="val_acc"),tune_report_callback],
                                 #   TuneReportCallback(
                                #    {
                                #        "loss": "val_loss",
                                #        "mean_accuracy": "val_accuracy" 
                                #    },
                                #        on="test_end")] # need to change this to validation but error at the minute
                                 #   ,
                         gpus=1 if str(device).startswith("cuda") else 0,
                         max_epochs=3,
                         progress_bar_refresh_rate=1,
                         logger=csv_logger,                         
                         )

    trainer.logger._default_hp_metric = None # Optional logging argument that we don't need

    # Check whether pretrained model exists. If yes, load it and skip training
    pretrained_filename = os.path.join(CHECKPOINT_PATH, f"GraphLevel{model_name}.ckpt")

    if os.path.isfile(pretrained_filename):
        print("Found pretrained model, loading...")
        model = GraphLevelGNN.load_from_checkpoint(pretrained_filename)
    else:
        pl.seed_everything(42)
        model = GraphLevelGNN(c_in = dataset.num_node_features, 
                              c_out=1, #if tu_dataset.num_classes==2 else tu_dataset.num_classes, 
                              **model_kwargs)
        trainer.fit(model, graph_train_loader, graph_val_loader)
        model = GraphLevelGNN.load_from_checkpoint(trainer.checkpoint_callback.best_model_path)
        
    # Test best model on validation and test set
    #train_result = trainer.test(model, graph_train_loader, verbose=False)
    #test_result = trainer.test(model, graph_test_loader, verbose=False)
    #result = {"test": test_result[0]['test_acc'], "train": train_result[0]['test_acc']} 
    #return model, result
    return model

# Example of ASHA Scheduler
scheduler_asha = ASHAScheduler(
    max_t=100,
    grace_period=1,
    reduction_factor=2,
)

from ray.tune.integration.pytorch_lightning import TuneReportCallback, TuneReportCheckpointCallback

reporter = CLIReporter(
    parameter_columns=['dropout'],
    metric_columns=["val_loss", "val_acc", "training_iteration"]
)


model = train_graph_classifier(model_name="GraphConv", 
                                       c_hidden=128, 
                                       layer_name="GraphConv", 
                                       num_layers=3, 
                                       dp_rate_linear=0.5,
                                       dp_rate=0.0)


result = tune.run(
    tune.with_parameters(
        model,
        #feature_size=10,
        #target_size=2,
        epochs=50,
        gpus=0
        ),

    resources_per_trial={
        "cpu": 1,
        "gpu": 0,
    },
    

    local_dir='/home/ad1/ray_ckpt2',  # path for saving checkpoints
    metric="val_loss",
    mode="min",
    config=config,
    num_samples=16,
    scheduler=scheduler_asha,
    progress_reporter=reporter,
    name="test",
)

And the error returned is:

(tune_with_parameters pid=65319) 2022-08-17 16:28:47,053        ERROR function_runner.py:286 -- Runner Thread raised error.
(tune_with_parameters pid=65319) Traceback (most recent call last):
(tune_with_parameters pid=65319)   File "/root/miniconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 277, in run
(tune_with_parameters pid=65319)     self._entrypoint()
(tune_with_parameters pid=65319)   File "/root/miniconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 352, in entrypoint
(tune_with_parameters pid=65319)     self._status_reporter.get_checkpoint(),
(tune_with_parameters pid=65319)   File "/root/miniconda3/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(tune_with_parameters pid=65319)     return method(self, *_args, **_kwargs)
(tune_with_parameters pid=65319)   File "/root/miniconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
(tune_with_parameters pid=65319)     output = fn()
(tune_with_parameters pid=65319)   File "/root/miniconda3/lib/python3.7/site-packages/ray/tune/utils/trainable.py", line 410, in inner
(tune_with_parameters pid=65319)     trainable(config, **fn_kwargs)
(tune_with_parameters pid=65319)   File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
(tune_with_parameters pid=65319)     return forward_call(*input, **kwargs)
(tune_with_parameters pid=65319) TypeError: forward() got an unexpected keyword argument 'checkpoint_dir'

Traceback (most recent call last):
  File "test_pytorch.py", line 390, in <module>
    name="test",
  File "/root/miniconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 741, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [tune_with_parameters_a90c2_00000, tune_with_parameters_a90c2_00001, 
...cut for space
tune_with_parameters_a90c2_00014, tune_with_parameters_a90c2_00015])

Could someone show me where I'm going wrong, how to I run HPO with tune in this network and then train the model with the best hyperparameters and then return the model for prediction?

Solution

Ray Tune expects a function trainable in the form of

def train_fn(config):
    # ...

In your case, it is probably best to wrap the train_graph_classifier function, e.g.

def train_fn(config):
    train_graph_classifier(
        model_name="GraphConv", 
        layer_name="GraphConv",
        **config)


analysis = tune.run(
    train_fn,
    config={
        # provide your hyperparameter search space here
        "c_hidden": tune.choice([64, 128]),
        "dp_rate_linear": tune.quniform(0.0, 1.0, 0.1),
        # ...
    },
    metric="val_loss",
    mode="min",
    # ...


print(analysis.best_checkpoint)

If you provide the TuneReportCheckpointCallback to the trainer, the analysis.best_checkpoint should contain the best model that can be then loaded for prediction, e.g.

with analysis.best_checkpoint.as_directory() as tmpdir:
    trainer = GraphLevelGNN.load_from_checkpoint(tmpdir)

Answered By - Kai

[FIXED] Replicating Ray Actors

July 16, 2022 distributed-computing, python, ray No comments

Issue

If an actor becomes a bottleneck for a Ray application, is there a way to replicate it and using the load balancing logic?

Using an actor as a service (by passing actor handles into Ray tasks), it creates use cases s.t. instead of the whole state/data belonging to an actor, and that state could be split into multiple actors in favor of the availability.

Is there a built-in tool, a common practice or a workaround for that, or does this have to be handled manually?

Solution

It turns out that the round robin logic is valid in this case. While passing replicated actors' handles to Ray tasks, actors can be distributed with round robin load balancing in mind.

E.g., if we have 150 Ray tasks (or worker actors) that will use 15 state holder actors' handles, we could basically;

StateHolderActors = [StateHolder.remote() for _ in range(15)]

futures = [task.remote(StateHolderActors[i % 15], *args) for i in range(150)]

Answered By - M.Erkin

[FIXED] Limiting CPU resources of Ray

July 16, 2022 cpu, distributed-computing, python, ray No comments

Issue

I'm trying to manage the resources of a remote machine that we use for a daily task (that uses Ray). Is it possible to limit the number of CPUs (or equivalently the number of workers) that Ray uses?

The remote machine has 16 cores. Can I limit Ray to use only 12 of them or so?

Solution

You can limit resources using:

# To start a head node.
ray start --head --num-cpus=12

# To start a non-head node.
ray start --redis-address=<redis-address> --num-cpus=12

Or via ray.init:

ray.init(num_cpus=12)

Source: Ray documentation.

Answered By - user1635327

[FIXED] How can I define the activation function as a hyperparameter in PyTorch through RAY.Tune?

July 14, 2022 hyperparameters, optimization, pytorch, ray No comments

Issue

This is the link to the main page of the topic I want.

https://docs.ray.io/en/latest/tune/examples/tune-pytorch-cifar.html#tune-pytorch-cifar-ref

But unfortunately, there is no good documentation to answer all the questions. Also, if you know how I can define nested cross validation in this environment, please tell me.

Solution

I solved this problem as follows.

n_samples,n_features=X_train.shape
class NeuralNetwork (nn.Module):
def __init__(self,n_input_features,l1, l2,l3,config):
    super (NeuralNetwork, self).__init__()
    self.config = config
    self.linear1=nn.Linear(n_input_features,4*math.floor(n_input_features/2)+l1)
    self.linear2=nn.Linear(l1+4*math.floor(n_input_features/2),math.floor(n_input_features/2)+l2)
    self.linear3=nn.Linear(math.floor(n_input_features/2)+l2,math.floor(n_input_features/3)+l3)
    self.linear4=nn.Linear(math.floor(n_input_features/3)+l3,math.floor(n_input_features/6))
    self.linear5=nn.Linear(math.floor(n_input_features/6),1)

    self.a1 = self.config.get("a1")
    self.a2 = self.config.get("a2")
    self.a3 = self.config.get("a3")
    self.a4 = self.config.get("a4") 
@staticmethod
def activation_func(act_str):
    if act_str=="tanh" or act_str=="sigmoid":
        return eval("torch."+act_str)
    elif act_str=="silu" or act_str=="relu" or act_str=="leaky_relu" or act_str=="gelu":   
        return eval("torch.nn.functional."+act_str)
def forward(self,x):
    out=self.linear1(x)
    out=self.activation_func(self.a1)(out.float())
    out=self.linear2(out)
    out=self.activation_func(self.a2)(out.float())
    out=self.linear3(out)
    out=self.activation_func(self.a3)(out.float())
    out=self.linear4(out)
    out=self.activation_func(self.a3)(out.float())
    out=torch.sigmoid(self.linear5(out))
    y_predicted=out
    return y_predicted

Answered By - Arash Sajjadi

[FIXED] Out of memory at every second trial using Ray Tune

June 07, 2022 ray, ray-tune, tensorflow No comments

Issue

I am tuning the hyperparameters using ray tune. The model is built in the tensorflow library, it occupies a large part of the available GPU memory. I noticed that every second call reports an out of memory error.It looks like the memory is being freed, you can see in the GPU memory usage graph, this is the moment between calls of consecutive trials, between which the OOM error occurred. I add that on smaller models I do not encounter this error and the graph looks the same.

How to deal with this out of memory error in every second trial ?

Memory usage graph

Solution

There's actually a utility that helps avoid this:

https://docs.ray.io/en/master/tune/api_docs/trainable.html#ray.tune.utils.wait_for_gpu

def tune_func(config):
    tune.utils.wait_for_gpu()
    train()

tune.run(tune_func, resources_per_trial={"GPU": 1}, num_samples=10)

Answered By - richliaw

Thursday, January 18, 2024

[FIXED] Parallelize DeepFace on multiple GPUs

Issue

Solution

Monday, November 28, 2022

[FIXED] Is there a way to prevent ray.init() from hanging when using Python on Apple silicon (the M1 Max)?

Issue

Solution

Tuesday, August 23, 2022

[FIXED] Pytorch and ray tune: why the error; raise TuneError("Trials did not complete", incomplete_trials)?

Issue

Solution

Saturday, July 16, 2022

[FIXED] Replicating Ray Actors

Issue

Solution

[FIXED] Limiting CPU resources of Ray

Issue

Solution

Thursday, July 14, 2022

[FIXED] How can I define the activation function as a hyperparameter in PyTorch through RAY.Tune?

Issue

Solution

Tuesday, June 7, 2022

[FIXED] Out of memory at every second trial using Ray Tune

Issue

Solution

Popular Posts

Labels