KaiRacters Dataset
State 2024.03.22
This dataset study is one of the experiments conducted within D-Scribes project of Prof. Dr. Isabelle Marthot-Santaniello, University of Basel, the aim of the project was the computer-assisted identification of a
scribe. It was used for the tests described in the article "KaiRacters: Character-level-based Writer Retrieval for Greek Papyri", by Marco Peer, Robert Sablatnig, and others (forthcoming).
GRK-120 consists of 120 documents and their parts belonging to 23 different scribes from the same village of Aphroditp that were identified so far. (See GRK-Papyri_120_dated in Excel and CSV and also visual presentation).
We have chosen a set of images cropped to a single writer, as opposed to the dataset where there are complete images containing the writings of multiple scribes, to be sure that the annotated letters belongs to the same hand. These images were annotated at a letter level in READ, version 1 (2023), installed on the server of Basel University.
The procedure of annotation is described here:
(https://link.springer.com/chapter/10.1007/978-3-030-86159-9_24) Olga Serbaeva & Stephen White Serbaeva2021d (with Stephen White) “READ for solving manuscript riddles: a preliminary study of the manuscripts of
the 3rd ṣaṭka of the Jayadrathayāmala” In: Document Analysis and Recognition – ICDAR 2021 Workshops, Lausanne, Switzerland, September 5–10, 2021 Proceedings, Part 2. Elisa H. Barney Smith • Umapada Pal (Eds.) Cham: Springer, 2021, pp. 339-348 (Lecture Notes in Computer Science Vol. 12917)
READ tutorial
Original Dataset (READ)
We have annotated 1 image per TM only in those cases when TM consisted on multiple images (That precise folio side will have the number of annotation in Excel).
Some items were not annotated, although planned.
These do not have values in the columns "n. of KAIs" and "n. of letters" in the attached overview Excel and CSV files.
Subset 1, "Letters" contain the cliplets with 24 letters of the Greek alphabet exported in PNG format. Only well readable lines were chosen for annotation. In those cases when the resulting sample did not contain enough good letters/not all letters, those were also added separately during the 2nd round of annotation.
The files were extracted between 2023.07.11 and 2023.09.08.
The "Letter" subset containes 9511 PNG cliplets, these were not tagged for quality and types (bt and ft tags), with a few exception needed for other tests. Dataset also contains 22 non-letter characters. The kappa letters written by Dios that were needed to balance the test data for the article, were aeparately in March 2024, and they are included here in a separate file called "Dios".
Subset 2 contains "KAIs" annotations. KAIs is one of the most frequent words and letter combinations in Greek. We have annotated all KAIs on the chosen images with the help of the editions available on Papyri.info. Some images contained no KAIs. The final set containes 1300 KAIs, of which of the best quality are 797 (bt1 tags in the file names), damaged are 434 items (tagged bt2), and unreadable are 67 (bt3).
Both subsets were checked to assure completeness and the absence of duplicates.
Overview files (Correspondance GRK_papyri_120, in Excel and CSV formats) contain metadata. For instance, the equivalence between the GRK image number and the TM number (unique identifier of the text accessible on trismegistos.org along with further metadata); the link ot the transcription of the text on papyri.info, dates, and the number of annotated segments for both subsets.
Cliplets are in PNG format, example of the file name: κ_D18944_bt2_ft1_2.15:
"κ" is the name of letter in greek,
D18944 stands for D meaning "Dioscorus dataset test" plus TM number;
bt1-bt2-b3 - quality of preservation from best to worst with bt1
to bt3. Ft1, etc - type of letter, only tagged as a test. 2.15 - line
and letter number in the document, corresponds to the transcription downloaded from Papyri.info.
"Kais" subset and kappas in "Letters" subset both have kappa (κ) in the file name and they do overlap in many cases.
Click here to download the .zip file containing the 2 subsets of cliplets, the overview files and the brief description.
The images used for annotation in .zip are forthcoming here.
Binarised Dataset (courtesy Marco Peer)
In this folder, we provide the binarized version of the GRK-120 dataset (GRK extended all 120), KAIS (KAIS) and the other annotated characters (Letters). The structure of the directory sticks to the original dataset. Information about the dataset in general can be found above.
Letter to page assignment
In the csv-files Kais2Page.csv and Letters2Page, we provide the mapping of the characters to the pages of the GRK-120 dataset, since the documents are mostly defined by their TM number, which is not necessarily related to the page numbers of the GRK-120 dataset. Additionally, top left and bottom right corner coordinates of the bounding boxes are included.
GRK subsets
For the evaluation in our paper, we use two subsets GRK69 for reporting the results for different letters, and GRK110 for results for KAIs. The pages used are found in the respective csv-files (GRK69, GRK110). With the letter mapping, the letters used in our paper can then be reconstructured.
Questions
If you have questions regarding the data or evaluation of our paper, feel free to reach out to us by writing a mail to Marco Peer, TU Wien, Institute of Visual Computing & Human-Centered Technology, Computer Vision Lab.
Click here to download the Binarised Dataset.