Splitting text into sentences is not a trivial - this is why you should add an extra space after the sentences
Here are some existing tools for splitting text into sentences:
None of them is recognizing the end of sentences 100% correctly.
Here is an example (from this blog):
Now you should know why you should enter two spaces after the sentences. Some programs rely on this.
Wikipedia: Sentence spacing - Digital age
When you type two spaces in a HTML code the extra space is not displayed.
Recommended articles:
Testing out the NLTK sentence tokenizer
How to Split Sentences
- OpenNLP
- Stanford CoreNLP
- GATE
- NLTK
- splitta
- LingPipe
- University of Illinois Sentence Segmentation tool
- spaCy
None of them is recognizing the end of sentences 100% correctly.
Here is an example (from this blog):
Now you should know why you should enter two spaces after the sentences. Some programs rely on this.
In the computer era, spacing between sentences is handled in several different ways by various software packages. Some systems accept whatever the user types, while others attempt to alter the spacing, or use the user input as a method of detecting sentences. Computer-based word processors, and software such as TeX allow users to arrange text in a manner previously only available to professional typesetters.[71]
The text editing environment in Emacs uses a double space following a period to identify the end of sentences unambiguously; the double space convention prevents confusion with periods within sentences that signify abbreviations. How Emacs recognizes the end of a sentence is controlled by the settings sentence-end-double-space and sentence-end.[72] The vi editor also follows this convention; thus, it is relatively easy to manipulate (jump over, copy, delete) whole sentences in both emacs and vi.
Wikipedia: Sentence spacing - Digital age
When you type two spaces in a HTML code the extra space is not displayed.
Recommended articles:
Testing out the NLTK sentence tokenizer
How to Split Sentences
Comments
Post a Comment