Sumarization

Summarization techniques aim to condense lengthy texts into shorter, more digestible versions while retaining the essential content. There are two primary approaches to summarization: extractive and abstractive. Extractive summarization involves selecting and arranging key sentences from the original text. This method relies on statistical and machine learning techniques to identify the most relevant sentences. Abstractive summarization, on the other hand, generates new sentences that capture the core meaning of the input text. Abstractive methods typically employ advanced neural network models, such as transformers, to produce more natural-sounding summaries. Applications include summarizing news articles, research papers, and business reports, helping users quickly digest large amounts of information. Some systems also allow querying content with focused question-answering features. Future improvements aim to enhance the accuracy of abstractive models and enable real-time summarization. This capability is useful for managing information in journalism, education, and professional environments.

ICSI test set

The ICSI Meeting corpus is a collection of 75 meetings collected at the International Computer Science Institute (ICSI) in Berkeley during the years 2000-2002 and released under the license CC-BY-4.0. The meetings included are "natural" meetings in the sense that they would have occurred anyway: they are generally regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. The dataset includes the English audio, as well as transcripts and summaries written by humans. In the textual summarization task the audio portion of the dataset is not used.

The dataset is a split of 6 meetings extracted by the Meetween project partner Zoom.

 A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, 2003, The ICSI Meeting Corpus, 2003 Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), Hong Kong, China.

Sumarization

Language en
en

AUTOMIN test set

The AutoMin dataset is the test set of the 2023 Workshop on Automatic Minuting and released under the license CC-BY-NC-SA-4.0. The data consists of meeting transcripts and human minutes, in English and Czech. The nature of meetings as well as the reference minutes are very different (technical project meetings and parliamentary sessions).

 Tirthankar Ghosal, Ondrej Bojar, Marie HledĂ­ková, Tom Kocmi, Anna Nedoluzhko, 2023, Overview of the Second Shared Task on Automatic Minuting (AutoMin) at INLG 2023, Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, Prague, Czech Republic

Sumarization

Language cs en
cs
en