aboutsummaryrefslogtreecommitdiffstats
path: root/CONTRIBUTING.md
diff options
context:
space:
mode:
authorMariusArhaug <mariusarhaug@hotmail.com>2024-04-03 15:14:33 +0200
committerMariusArhaug <mariusarhaug@hotmail.com>2024-04-03 16:14:37 +0200
commit80744246aff5cb9294496842ea27bf703e430c99 (patch)
tree0ef74371e57a2d3e2d6db1b8ed405646049f0ea5 /CONTRIBUTING.md
parent81dd10993cdbf1053926159d45b922ebd41e32df (diff)
Add SimpleTokenScript to SimpleTokenizer
When parsing datasets such as WikiDumps to a significance model, we want to only keep characters of that language script within our model. So when adding the script value to our tokenizer we are able to use this to filter out non-latin words when creating an english significance model for example.
Diffstat (limited to 'CONTRIBUTING.md')
0 files changed, 0 insertions, 0 deletions