System eksploracji tekstów literackich

Literary Exploration Machine

Interfejs webowy do eksploracji tekstów litrerackich w języku polskim. Wykorzystane narzędzia:
Literary Exploration Machine (LEM) provides a virtual research environment for textual scholars, allowing them to upload texts in Polish and either explore them with a suite of dedicated tools or transform them into another format (text, table, list).
The used tools include:
This project brings together the already existing applications developed by CLARIN-PL and supplements them with new functionalities. The main features of LEM include: lemmatization, part-of-speech tagging, generating custom wordlists and lemmatized texts. Planned features include: text clustering, semantic text classification based on machine learning and visualisation of its output, topic modeling.
LEM is developed by CLARIN-PL (Wrocław University of Science and Technology) in collaboration with the Institute of Literary Research, Polish Academy of Sciences. The project is based on a close cooperation between IT professionals, linguists and literary scholars, which ensures that the tools will suit actual researchers’ needs.
The project is under development. If you have any comments or would like to request a new feature, please contact the members of the project team.
In order to analyse your texts:
  1. Prepare a zip archive with your texts (do not use folders in the zip file). The following formats are accepted: txt, rtf, doc, docx, odt, xlslx, pdf. File size is limited. Should you need to process larger files, please contact the project team.
  2. Click on the file box and find the location of the zip file on your hard drive, or simply drag it and drop in the file box. The file will be uploaded automatically.
  3. Choose the version of the morphological analyser:
    • Morfeusz 1 - older version, smaller register supported by WCRFT2 to guess PoS for unknown words. Recommended for older texts.
    • Morfeusz 2 - provides additional features of the words being analysed (a classification of proper names and stylistic labels were added), it is also equipped with a synthesis module, larger register containing modern words. Recommended for modern texts and electronic discourse.
  4. Choose the task:
    1. Lemmatisation (returns text files with lemmatised words)
    2. Part of Speech Tagging (returns a csv table for each text with words assigned to particular parts of speech according to NKJP tagset)
    3. Verb characteristics (returns a xlsx table for the entire corpus with number of tokens, and verbs divided into subgroups: infinitive; 1st, 2nd, 3rd person signular; 1st, 2nd, 3rd person plural)
    4. Lemmas and POS statistics (returns xlsx tables containing statistics on the amount and percentage of particular lemmas or parts of speech in the entire corpus)
    5. Named-entity recognition (returns txt files, each containing a list of named entitites occuring in particular text)
    6. Disambiguation (returns csv files, each containing a list of words occuring in particular text together with their synonyms derived from Słowosieć/PLWordnet)
    7. Hyperonyms & Hyponyms (returns csv files, each containing a list of words occuring in particular text together with their hiperonyms & hyponyms derived from Słowosieć/PLWordnet)
    8. Stylometric analysis with WebSty (allows for exploration of similarity of texts with several visualisation options)
  5. Click the "Process" button to perform the analysis.
  6. Upon completion data will be available for download under the “Result” link.
  1. Należy przeciągnąć plik w formacie ZIP zawierający pliki txt, doc,docx, pptx, xlsx, odt, pdf, html, rtf (zostaną one automatycznie przekonwertowane do tekstu)
  2. Wybrać przycisk przetwórz
  3. Po wykonaniu analizy wyświetli się link do załadowania wyników