![]() ![]() To quantify such differences of grammar trees, the concept of pq-grams is used. It can be seen that the trees differ significantly, although the semantic meaning is the same. Figure 1 shows the parse trees of the Einstein quote “Insanity: doing the same thing over and over again and expecting different results” ( \(S_1\)) and a slightly modified version ( \(S_2\)). Thereby, a parse tree (or syntax tree) for each sentence is calculated, which consists of structured POS tags and serves as the main processing unit to investigate the style of an author. All applications presented in this paper rely on the analysis of sentences without considering the vocabulary used. As a consequence, those patterns can be identified and utilized as a style marker. Nevertheless, the number of choices is large, which leads to the assumption that writers intuitively reuse preferred patterns to build their sentences. ![]() While constructing sentences, an author has to adhere to the syntactic rules defined by a specific language. All of those applications are based on a pure analysis of the grammar syntax of authors and processed by commonly used machine learning algorithms. This paper gives an overview of our recent grammar-based research in the broad field of author analysis, including authorship attribution, profiling, plagiarism detection and Bible analysis. On the contrary it becomes steadily harder for detection systems to find misuses by just comparing text, and thus advanced algorithms have to be developed. A related problem emerges from the fact that the vast amount of available text collections makes it easier for a potential plagiarist to find fragments that can be copied. Typical metrics to build stylistic fingerprints include lexical features like character n-grams (e.g., ), word frequencies (e.g., ) or average word/sentence lengths (e.g., ), syntactic features like Part-of-Speech (POS) tag frequencies (e.g., ) or structural features like average paragraph lengths or indentation usages (e.g., ). A still very important field, which is discussed since the 19 \(^\) century and which attempts to solve the problem to automatically detect (information about) the writer of a text is authorship attribution. Such data provides a huge source for scientific research in many different areas including text mining problems like web content mining or sentiment analysis, but also for social media text based recommender systems (e.g., ). One of the consequences of todays possibilities and ease to share information over the world wide web is the high availability of textual data, which is either created by social media users or made publicly available through large literary databases like Project Gutenberg Footnote 1.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |