Skip to content

n-grams greater than 2 #50

Open
Open
@lawest59

Description

I was looking to use trigrams because there are significant three-word phrases in my corpus (e.g. "economies in transition" to refer to developing countries). I used the following code in R.

statements <- prep_word2vec(basePath,
"docs.txt",
lowercase=T, bundle_ngrams = 3, threshold = 50)

w2v <- train_word2vec("docs.txt",
output="./stat_vecs.bin",
threads=detectCores(),
vectors=100,
window=7,
force=TRUE)

It worked as expected with the exception that I got some four word phrases (e.g. "so_that_they_can"). I'm curious why this is happening. Thanks!

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions