Description
Description
I am working on a university assignment that involves extracting Named Entities (NE) from Polish text using a BERT-based model. I have chosen the FastPDN model from Hugging Face clarin-pl/FastPDN and prepared it using the utils/convert_model.py script.
I created TokenClassificationConfig based on one of examples (files config, special_tokens_map are downloaded from huggingface, simillar vocab.json but I extracted all keys from json and saved them in txt, each in new line)
let input = ["Nazywam się Jan Kowalski i mieszkam we Wrocławiu."];
let config = TokenClassificationConfig::new(
ModelType::Bert,
ModelResource::Torch(Box::new(LocalResource::from(PathBuf::from(model_path)))),
LocalResource::from(PathBuf::from(model_config_path)),
LocalResource::from(PathBuf::from(vocab_path)),
Some(LocalResource::from(PathBuf::from(merge_path))), //merges resource only relevant with ModelType::Roberta
false, //lowercase
false,
None,
LabelAggregationOption::Mode,
);
Initially, I encountered issues with tokenization when using the BertTokenizer. The output tokens did not match the expected format, leading to incorrect predictions when using the predict_full_entities method.
let tokenizer = BertTokenizer::from_file_with_special_token_mapping(vocab_path, false, false, special_tokens)?;
println!("{:?}", tokenizer.tokenize(input[0]));
let ner_model = NERModel::new_with_tokenizer(config, TokenizerOption::Bert(tokenizer))?;
let output = ner_model.predict_full_entities(&input);
for entity in output {
println!("{entity:?}");
}
as output I got:
["<unk>", "się", "Jan", "<unk>", "i", "<unk>", "we", "<unk>", "."]
[]
Upon switching to a tokenizer created from a tokenizer.json file (using TokenizerOption::from_hf_tokenizer_file), the tokenization improved significantly. The tokens now correctly represent the words and punctuation in the input text.
let tok_opt = TokenizerOption::from_hf_tokenizer_file(tokenizer_path, special_tokens).unwrap();
println!("{:?}", tok_opt.tokenize(input[0]));
let ner_model = NERModel::new_with_tokenizer(config, tok_opt)?;
["Nazy", "wam</w>", "się</w>", "Jan</w>", "Kowalski</w>", "i</w>", "mieszkam</w>", "we</w>", "Wrocławiu</w>", ".</w>"]
But now I encountered a runtime panic during the prediction phase:
thread 'main' panicked at <path>/rust-bert/src/pipelines/token_classification.rs:1113:51:
slice index starts at 50 but ends at 49
Environment:
- Rust version: 1.77.2
- PyTorch version: 2.2.0
- tch version: v0.15.0
- rust-bert copy of repository (current version from the main branch)
I would be grateful if you could help.
EDIT: trying to use BertTokenizer was a complete mistake on my part, due to the model apparently using a customized tokenizer which is slightly different from base BERT's one.
Activity