Skip to content

Met error when fine-tuning GatedCNN on multiple GPUs. #138

Open
@Embedding

Description

CUDA_VISIBLE_DEVICES=0,1 /dockerdata/anaconda3-2/bin/python run_classifier.py --vocab_path models/google_zh_vocab.txt \
             --config_path models/gatedcnn_9_config.json \
             --train_path datasets/chnsenticorp/train.tsv --dev_path datasets/chnsenticorp/dev.tsv --test_path datasets/chnsenticorp/test.tsv \
             --learning_rate 1e-4  --batch_size 64 --epochs_num 5 \
             --embedding word --remove_embedding_layernorm --encoder gatedcnn --pooling max
Traceback (most recent call last):
  File "run_classifier.py", line 339, in <module>
    main()
  File "run_classifier.py", line 317, in main
    loss = train_model(args, model, optimizer, scheduler, src_batch, tgt_batch, seg_batch, soft_tgt_batch)
  File "run_classifier.py", line 179, in train_model
    loss, _ = model(src_batch, tgt_batch, seg_batch, soft_tgt_batch)
  File "/dockerdata/anaconda3-2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/dockerdata/anaconda3-2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/dockerdata/anaconda3-2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/dockerdata/anaconda3-2/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/dockerdata/anaconda3-2/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
IndexError: Caught IndexError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/dockerdata/anaconda3-2/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/dockerdata/anaconda3-2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "run_classifier.py", line 42, in forward
    output = self.encoder(emb, seg)
  File "/dockerdata/anaconda3-2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/dockerdata/nlpzhezhao-14/uer_t5_4/UER-py-master/uer/encoders/cnn_encoder.py", line 61, in forward
    hidden += self.conv_b[i].repeat(1, 1, seq_length, 1)
  File "/dockerdata/anaconda3-2/lib/python3.7/site-packages/torch/nn/modules/container.py", line 426, in __getitem__
    idx = self._get_abs_string_index(idx)
  File "/dockerdata/anaconda3-2/lib/python3.7/site-packages/torch/nn/modules/container.py", line 409, in _get_abs_string_index
    raise IndexError('index {} is out of range'.format(idx))
IndexError: index 0 is out of range

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions