November 12, 2018

Antoine Amarilli (Télécom ParisTech)

We study the data management task of extracting structured
information from unstructured documents, e.g., raw text or HTML pages.
We use the framework of document spanners, where the pattern to extract
is specified declaratively by the user (as a regular expression with
capture variables) and is translated to an automaton that then is
evaluated on the document to compute the pattern matches. We focus on
the last step of this pipeline: our goal is to efficiently find the
matches of the automaton on the document, even when there can be many of
them. We do this by proposing an enumeration algorithm, which first
preprocesses the automaton and document, and then enumerates all matches
with a small delay between consecutive matches. Unlike previous work,
our algorithm achieves the best possible bounds in the input document
(namely, linear preprocessing and constant delay), while remaining
tractable in the automaton. The guiding principle of the algorithm is to
compute a factorized representation of all matches as a product of the
automaton and document, and design efficient indexes based on the
structure of this representation. We also present our ongoing follow-up
work, e.g., how to extend our algorithm to the case of tree-shaped
documents by efficiently enumerating the matches of tree automata, how
to efficiently update the enumeration results when the input document
changes, and other open research directions.