Open
Description
It's dejecting when looking lower in the global frequency list, seeing just how much space is being wasted by OCR errors vs. legitimate-but-rare words. For every hundred tokens of junk, you get one "greyish-olive" or "tetraspilus". Should we explore OCR accuracy estimation methods, so that after the top two million words or so, we can start raising our standards for what a token is? We'd be able to dig deeper down the list that way, but I'm not sure if it's a useful endeavor.
Activity