### 12 novembre 2018

**Antoine Amarilli**(Télécom ParisTech)

We study the data management task of extracting structured

information from unstructured documents, e.g., raw text or HTML pages.

We use the framework of document spanners, where the pattern to extract

is specified declaratively by the user (as a regular expression with

capture variables) and is translated to an automaton that then is

evaluated on the document to compute the pattern matches. We focus on

the last step of this pipeline: our goal is to efficiently find the

matches of the automaton on the document, even when there can be many of

them. We do this by proposing an enumeration algorithm, which first

preprocesses the automaton and document, and then enumerates all matches

with a small delay between consecutive matches. Unlike previous work,

our algorithm achieves the best possible bounds in the input document

(namely, linear preprocessing and constant delay), while remaining

tractable in the automaton. The guiding principle of the algorithm is to

compute a factorized representation of all matches as a product of the

automaton and document, and design efficient indexes based on the

structure of this representation. We also present our ongoing follow-up

work, e.g., how to extend our algorithm to the case of tree-shaped

documents by efficiently enumerating the matches of tree automata, how

to efficiently update the enumeration results when the input document

changes, and other open research directions.