Sumarization
Summarization techniques aim to condense lengthy texts into shorter, more digestible versions while retaining the essential content.
There are two primary approaches to summarization: extractive and abstractive.
Extractive summarization involves selecting and arranging key sentences from the original text.
This method relies on statistical and machine learning techniques to identify the most relevant sentences.
Abstractive summarization, on the other hand, generates new sentences that capture the core meaning of the input text.
Abstractive methods typically employ advanced neural network models, such as transformers, to produce more natural-sounding summaries.
Applications include summarizing news articles, research papers, and business reports, helping users quickly digest large amounts of information.
Some systems also allow querying content with focused question-answering features. Future improvements aim to enhance the accuracy of abstractive models and enable real-time summarization.
This capability is useful for managing information in journalism, education, and professional environments.
ICSI test set
The ICSI Meeting corpus is a collection of 75 meetings collected at the
International Computer Science Institute (ICSI) in Berkeley during the years
2000-2002 and released under the license CC-BY-4.0. The meetings included
are "natural" meetings in the sense that they would have occurred anyway:
they are generally regular weekly meetings of various ICSI working teams,
including the team working on the ICSI Meeting Project. The dataset includes
the English audio, as well as transcripts and summaries written by humans.
In the textual summarization task the audio portion of the dataset is not
used.
The dataset is a split of 6 meetings extracted by the Meetween project
partner Zoom.
A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, 2003,
The ICSI Meeting Corpus, 2003 Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), Hong
Kong, China.
AUTOMIN test set
The AutoMin dataset is the test set of the 2023 Workshop on Automatic
Minuting and released under the license CC-BY-NC-SA-4.0. The data consists
of meeting transcripts and human minutes, in English and Czech. The nature
of meetings as well as the reference minutes are very different (technical
project meetings and parliamentary sessions).
Tirthankar Ghosal, Ondrej Bojar, Marie HledĂková, Tom Kocmi, Anna Nedoluzhko,
2023, Overview of the Second Shared Task on Automatic Minuting (AutoMin) at
INLG 2023, Proceedings of the 16th International Natural Language
Generation Conference: Generation Challenges, Prague, Czech Republic