Skip to content

NER with BERT-based Model: Unexpected Panic During Prediction #455

Open
@mmich-pl

Description

Description

I am working on a university assignment that involves extracting Named Entities (NE) from Polish text using a BERT-based model. I have chosen the FastPDN model from Hugging Face clarin-pl/FastPDN and prepared it using the utils/convert_model.py script.

I created TokenClassificationConfig based on one of examples (files config, special_tokens_map are downloaded from huggingface, simillar vocab.json but I extracted all keys from json and saved them in txt, each in new line)

  let input = ["Nazywam się Jan Kowalski i mieszkam we Wrocławiu."];

  let config = TokenClassificationConfig::new(
          ModelType::Bert,
          ModelResource::Torch(Box::new(LocalResource::from(PathBuf::from(model_path)))),
          LocalResource::from(PathBuf::from(model_config_path)),
          LocalResource::from(PathBuf::from(vocab_path)),
          Some(LocalResource::from(PathBuf::from(merge_path))),  //merges resource only relevant with ModelType::Roberta
          false, //lowercase
          false,
          None,
          LabelAggregationOption::Mode,
      );

Initially, I encountered issues with tokenization when using the BertTokenizer. The output tokens did not match the expected format, leading to incorrect predictions when using the predict_full_entities method.

    let tokenizer = BertTokenizer::from_file_with_special_token_mapping(vocab_path, false, false, special_tokens)?;
    println!("{:?}", tokenizer.tokenize(input[0]));
   
    let ner_model = NERModel::new_with_tokenizer(config, TokenizerOption::Bert(tokenizer))?;
    let output = ner_model.predict_full_entities(&input);
    for entity in output {
        println!("{entity:?}");
    }

as output I got:

["<unk>", "się", "Jan", "<unk>", "i", "<unk>", "we", "<unk>", "."]
[]

Upon switching to a tokenizer created from a tokenizer.json file (using TokenizerOption::from_hf_tokenizer_file), the tokenization improved significantly. The tokens now correctly represent the words and punctuation in the input text.

    let tok_opt = TokenizerOption::from_hf_tokenizer_file(tokenizer_path, special_tokens).unwrap();
    println!("{:?}", tok_opt.tokenize(input[0]));
    let ner_model = NERModel::new_with_tokenizer(config, tok_opt)?;
["Nazy", "wam</w>", "się</w>", "Jan</w>", "Kowalski</w>", "i</w>", "mieszkam</w>", "we</w>", "Wrocławiu</w>", ".</w>"]

But now I encountered a runtime panic during the prediction phase:

thread 'main' panicked at <path>/rust-bert/src/pipelines/token_classification.rs:1113:51:
slice index starts at 50 but ends at 49

Environment:

  • Rust version: 1.77.2
  • PyTorch version: 2.2.0
  • tch version: v0.15.0
  • rust-bert copy of repository (current version from the main branch)

I would be grateful if you could help.

EDIT: trying to use BertTokenizer was a complete mistake on my part, due to the model apparently using a customized tokenizer which is slightly different from base BERT's one.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions