Commit ddcdf026 authored by Eric Frias's avatar Eric Frias
Browse files

More changes to preprocessing and chunking logic:

 - better (but more expensive) filtering of HTML
 - don't compress whitespace until after sentence detection
 - use SBD for sentence detection instead of spacy, it's much smaller
parent c255dd88
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment