NER with BERT-based Model: Unexpected Panic During Prediction

## Description
I am working on a university assignment that involves extracting Named Entities (NE) from Polish text using a BERT-based model. I have chosen the FastPDN model from Hugging Face [clarin-pl/FastPDN](https://huggingface.co/clarin-pl/FastPDN) and prepared it using the utils/convert_model.py script. 

I created TokenClassificationConfig based on one of examples (files config, special_tokens_map are downloaded from huggingface, simillar vocab.json but I extracted all keys from json and saved them in txt, each in new line) 
```rust 
  let input = ["Nazywam się Jan Kowalski i mieszkam we Wrocławiu."];

  let config = TokenClassificationConfig::new(
          ModelType::Bert,
          ModelResource::Torch(Box::new(LocalResource::from(PathBuf::from(model_path)))),
          LocalResource::from(PathBuf::from(model_config_path)),
          LocalResource::from(PathBuf::from(vocab_path)),
          Some(LocalResource::from(PathBuf::from(merge_path))),  //merges resource only relevant with ModelType::Roberta
          false, //lowercase
          false,
          None,
          LabelAggregationOption::Mode,
      );
```

Initially, I encountered issues with tokenization when using the BertTokenizer. The output tokens did not match the expected format, leading to incorrect predictions when using the predict_full_entities method.

```rust
    let tokenizer = BertTokenizer::from_file_with_special_token_mapping(vocab_path, false, false, special_tokens)?;
    println!("{:?}", tokenizer.tokenize(input[0]));
   
    let ner_model = NERModel::new_with_tokenizer(config, TokenizerOption::Bert(tokenizer))?;
    let output = ner_model.predict_full_entities(&input);
    for entity in output {
        println!("{entity:?}");
    }
```

as output I got:
```
["<unk>", "się", "Jan", "<unk>", "i", "<unk>", "we", "<unk>", "."]
[]
```

Upon switching to a tokenizer created from a tokenizer.json file (using TokenizerOption::from_hf_tokenizer_file), the tokenization improved significantly. The tokens now correctly represent the words and punctuation in the input text.

```rust
    let tok_opt = TokenizerOption::from_hf_tokenizer_file(tokenizer_path, special_tokens).unwrap();
    println!("{:?}", tok_opt.tokenize(input[0]));
    let ner_model = NERModel::new_with_tokenizer(config, tok_opt)?;
 ```

```
["Nazy", "wam</w>", "się</w>", "Jan</w>", "Kowalski</w>", "i</w>", "mieszkam</w>", "we</w>", "Wrocławiu</w>", ".</w>"]
```

But now I encountered a runtime panic during the prediction phase:
```
thread 'main' panicked at <path>/rust-bert/src/pipelines/token_classification.rs:1113:51:
slice index starts at 50 but ends at 49
``` 

## Environment:
- Rust version: 1.77.2
- PyTorch version: 2.2.0
- tch version: v0.15.0
- rust-bert  copy of repository (current version from the main branch)

I would be grateful if you could help.

EDIT: trying to use BertTokenizer was a complete mistake on my part, due to the model apparently using a customized [tokenizer](https://github.com/huggingface/transformers/blob/main/src/transformers/models/herbert/tokenization_herbert.py) which is slightly different from base BERT's one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER with BERT-based Model: Unexpected Panic During Prediction #455

Description

Environment:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

NER with BERT-based Model: Unexpected Panic During Prediction #455

Description

Description

Environment:

Activity

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions