Sumarization
Summarization techniques aim to condense lengthy texts into shorter, more digestible
versions while retaining the essential content. There are two primary approaches to
summarization: extractive and abstractive.
Extractive summarization involves selecting and arranging key sentences from the
original text. This method relies on statistical and machine learning techniques to
identify the most relevant sentences. Abstractive summarization, on the other hand,
generates new sentences that capture the core meaning of the input text. Abstractive
methods typically employ advanced neural network models, such as transformers, to
produce more natural-sounding summaries.
Applications include summarizing news articles, research papers, and business
reports, helping users quickly digest large amounts of information. Some systems
also allow querying content with focused question-answering features.
Future improvements aim to enhance the accuracy of abstractive models and enable
real-time summarization. This capability is useful for managing information in
journalism, education, and professional environments.
ICSI test set
The ICSI Meeting corpus is a collection of 75 meetings collected at the International Computer Science Institute (ICSI) in Berkeley during the years 2000-2002 and released under the license CC-BY-4.0. The meetings included are "natural" meetings in the sense that they would have occurred anyway: they are generally regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. The dataset includes the English audio, as well as transcripts and summaries written by humans. In the textual summarization task the audio portion of the dataset is not used.
The dataset is a split of 6 meetings extracted by the Meetween project partner Zoom.
The dataset is a split of 6 meetings extracted by the Meetween project partner Zoom.
A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, 2003, The ICSI Meeting Corpus, 2003 Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), Hong Kong, China.
AUTOMIN test set
The AutoMin dataset is the test set of the 2023 Workshop on Automatic Minuting and released under the license CC-BY-NC-SA-4.0. The data consists of meeting transcripts and human minutes, in English and Czech. The nature of meetings as well as the reference minutes are very different (technical project meetings and parliamentary sessions).
Tirthankar Ghosal, Ondrej Bojar, Marie HledĂková, Tom Kocmi, Anna Nedoluzhko, 2023, Overview of the Second Shared Task on Automatic Minuting (AutoMin) at INLG 2023, Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges, Prague, Czech Republic