aboutsummaryrefslogtreecommitdiffstats
path: root/jrt_test
diff options
context:
space:
mode:
authorMariusArhaug <mariusarhaug@hotmail.com>2024-04-03 15:14:33 +0200
committerMariusArhaug <mariusarhaug@hotmail.com>2024-04-03 16:14:37 +0200
commit80744246aff5cb9294496842ea27bf703e430c99 (patch)
tree0ef74371e57a2d3e2d6db1b8ed405646049f0ea5 /jrt_test
parent81dd10993cdbf1053926159d45b922ebd41e32df (diff)
Add SimpleTokenScript to SimpleTokenizer
When parsing datasets such as WikiDumps to a significance model, we want to only keep characters of that language script within our model. So when adding the script value to our tokenizer we are able to use this to filter out non-latin words when creating an english significance model for example.
Diffstat (limited to 'jrt_test')
0 files changed, 0 insertions, 0 deletions