diff options
author | MariusArhaug <mariusarhaug@hotmail.com> | 2024-04-03 15:14:33 +0200 |
---|---|---|
committer | MariusArhaug <mariusarhaug@hotmail.com> | 2024-04-03 16:14:37 +0200 |
commit | 80744246aff5cb9294496842ea27bf703e430c99 (patch) | |
tree | 0ef74371e57a2d3e2d6db1b8ed405646049f0ea5 /jrt_test | |
parent | 81dd10993cdbf1053926159d45b922ebd41e32df (diff) |
Add SimpleTokenScript to SimpleTokenizer
When parsing datasets such as WikiDumps to a significance model, we want
to only keep characters of that language script within our model. So
when adding the script value to our tokenizer we are able to use this to
filter out non-latin words when creating an english significance model
for example.
Diffstat (limited to 'jrt_test')
0 files changed, 0 insertions, 0 deletions