Load missing keys default from argparse (#111) · rlrs/torchtitan@4042b05

Commit

Load missing keys default from argparse (pytorch#111)

```
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] 
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] *****************************************
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] *****************************************
[rank0]:2024-03-04 17:01:28,834 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank1]:2024-03-04 17:01:28,857 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-03-04 17:01:29,712 - root - INFO - Starting job: debug training
[rank0]:2024-03-04 17:01:29,712 - root - INFO - Building llama
[rank0]:2024-03-04 17:01:29,719 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-04 17:01:29,719 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank1]:2024-03-04 17:01:31,187 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-03-04 17:01:31,188 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-03-04 17:01:31,346 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-03-04 17:01:31,346 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-04 17:01:31,347 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-04 17:01:31,347 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-03-04 17:01:32,502 - root - INFO - Applied FSDP to the model...
[rank0]:2024-03-04 17:01:32,503 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-03-04 17:01:32,504 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240304-1701.
[rank0]:2024-03-04 17:01:32,901 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-03-04 17:01:34,806 - root - INFO - �[36mstep:  1  �[32mloss: 10.8424  �[39miter: �[34m 1.8688�[39m  data: �[34m0.0316  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-03-04 17:01:34,891 - root - INFO - �[36mstep:  2  �[32mloss: 10.7581  �[39miter: �[34m 0.0476�[39m  data: �[34m0.0357  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-03-04 17:01:34,970 - root - INFO - �[36mstep:  3  �[32mloss: 10.6239  �[39miter: �[34m  0.045�[39m  data: �[34m0.0333  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-03-04 17:01:35,048 - root - INFO - �[36mstep:  4  �[32mloss: 10.4163  �[39miter: �[34m 0.0455�[39m  data: �[34m0.0323  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-03-04 17:01:35,127 - root - INFO - �[36mstep:  5  �[32mloss: 10.1529  �[39miter: �[34m 0.0459�[39m  data: �[34m0.032  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-03-04 17:01:35,206 - root - INFO - �[36mstep:  6  �[32mloss:  9.8899  �[39miter: �[34m 0.0468�[39m  data: �[34m0.0311  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-03-04 17:01:35,284 - root - INFO - �[36mstep:  7  �[32mloss:  9.7204  �[39miter: �[34m 0.0461�[39m  data: �[34m0.0312  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-03-04 17:01:35,425 - root - INFO - �[36mstep:  8  �[32mloss:  9.3757  �[39miter: �[34m 0.0457�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0003�[39m
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-03-04 17:01:35,537 - root - INFO - �[36mstep:  9  �[32mloss:  9.1883  �[39miter: �[34m 0.0762�[39m  data: �[34m0.0318  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-03-04 17:01:35,958 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-03-04 17:01:35,971 - root - INFO - �[36mstep: 10  �[32mloss:  9.1212  �[39miter: �[34m 0.0808�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Average iter time: 0.0553 seconds
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Average data load time: 0.0317 seconds
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0
```

Co-authored-by: gnadathur <[email protected]>

Loading branch information

gnadathur and gnadathur authored Mar 5, 2024

1 parent 42f8907 commit 4042b05

torchtrain/config_manager.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -29,7 +29,9 @@ def parse_args(self, args_list: list = sys.argv[1:]): @@
             args_dict = self._args_to_two_level_dict(args)
             if config_file is not None:
                 with open(config_file, "rb") as f:
-                    args_dict |= tomllib.load(f)
+                    for k, v in tomllib.load(f).items():
+                        # to prevent overwrite of non-specified keys
+                        args_dict[k] |= v
             for k, v in args_dict.items():
                 class_type = type(k.title(), (), v)
                 setattr(self, k, class_type())
@@ Expand Down Expand Up / @@ -225,7 +227,8 @@ def init_args_from_command_line( @@
             )
             parser.add_argument(
                 "--training.enable_selective_ac",
-                action="store_false",
+                default=False,
+                action="store_true",
                 help="whether to enable selective activation checkpointing",
             )
             return parser.parse_args(args_list)

train_configs/debug_model.toml

-Original file line number
+Diff line change
@@ Expand Up / @@ -38,4 +38,3 @@ checkpoint_interval = 3600 @@
     checkpoint_interval_type = "steps"
     checkpoint_folder = ""
     dataset = "alpaca"
-    enable_selective_ac = false

0 comments on commit `4042b05`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `4042b05`

Commit

There are no files selected for viewing

0 comments on commit 4042b05

0 comments on commit `4042b05`