To analyze or debug complex data processing applications, or to ensure their understandability and repeatability, provenance techniques are increasingly being deployed, resulting in large volumes and a wide variety of provenance data. The long-term goal of this project is to leverage visualization techniques to efficiently and effectively explore provenance data. In the first funding period, we will focus on properly visualizing the full provenance data generated for one run of a data-processing pipeline. This involves both quantifiably identifying suited visualizations for various provenance types and ensuring user-friendly provenance data generation and visualization in existing data processing pipelines.
What are suitable visualization techniques for different settings defined by varying types of provenance and applications?
Which metrics can quantitatively assess provenance data visualization quality?
How can such metrics support tuning processes generating and managing provenance data?
Which types of provenance are best suited to achieve the goals of reproducibility and predictability for selected visual computing processes?
Fig. 1:Visualizing and Interacting with Provenance Data
V. Bruder et al., “Volume-Based Large Dynamic Graph Analysis Supported by Evolution Provenance,” Multimedia Tools and Applications, vol. 78, no. 23, Art. no. 23, 2019, doi: 10.1007/s11042-019-07878-6.
We present an approach for the visualization and interactive analysis of dynamic graphs that contain a large number of time steps. A specific focus is put on the support of analyzing temporal aspects in the data. Central to our approach is a static, volumetric representation of the dynamic graph based on the concept of space-time cubes that we create by stacking the adjacency matrices of all time steps. The use of GPU-accelerated volume rendering techniques allows us to render this representation interactively. We identified four classes of analytics methods as being important for the analysis of large and complex graph data, which we discuss in detail: data views, aggregation and filtering, comparison, and evolution provenance. Implementations of the respective methods are presented in an integrated application, enabling interactive exploration and analysis of large graphs. We demonstrate the applicability, usefulness, and scalability of our approach by presenting two examples for analyzing dynamic graphs. Furthermore, we let visualization experts evaluate our analytics approach.
S. Oppold and M. Herschel, “Provenance for Entity Resolution,” in Provenance and Annotation of Data and Processes. IPAW 2018. Lecture Notes in Computer Science, vol. 11017, K. Belhajjame, A. Gehani, and P. Alper, Eds. Springer International Publishing, 2018, pp. 226–230.
Data provenance can support the understanding and debugging of complex data processing pipelines, which are for instance common in data integration scenarios. One task in data integration is entity resolution (ER), i.e., the identification of multiple representations of a same real world entity. This paper focuses of provenance modeling and capture for typical ER tasks. While our definition of ER provenance is independent of the actual language or technology used to define an ER task, the method we implement as a proof of concept instruments ER rules specified in HIL, a high-level data integration language.
C. Schulz, A. Zeyfang, M. van Garderen, H. Ben Lahmar, M. Herschel, and D. Weiskopf, “Simultaneous Visual Analysis of Multiple Software Hierarchies,” in Proceedings of the IEEE Working Conference on Software Visualization (VISSOFT), 2018, pp. 87–95, doi: 10.1109/VISSOFT.2018.00017.
We propose a tree visualization technique for comparison of structures and attributes across multiple hierarchies. Many software systems are structured hierarchically by design. For example, developers subdivide source code into libraries, modules, and functions. This design propagates to software configuration and business processes, rendering software hierarchies even more important. Often these structural elements are attributed with reference counts, code quality metrics, and the like. Throughout the entire software life cycle, these hierarchies are reviewed, integrated, debugged, and changed many times by different people so that the identity of a structural element and its attributes is not clearly traceable. We argue that pairwise comparison of similar trees is a tedious task due to the lack of overview, especially when applied to a large number of hierarchies. Therefore, we strive to visualize multiple similar trees as a whole by merging them into one supertree. To merge structures and combine attributes from different trees, we leverage the Jaccard similarity and solve a matching problem while keeping track of the origin of a structure element and its attributes. Our visualization approach allows users to inspect these supertrees using node-link diagrams and indented tree plots. The nodes in these plots depict aggregated attributes and, using word-sized line plots, detailed data. We demonstrate the usefulness of our method by exploring the evolution of software repositories and debugging data processing pipelines using provenance data.
H. Ben Lahmar, M. Herschel, M. Blumenschein, and D. A. Keim, “Provenance-based Visual Data Exploration with EVLIN,” in Proceedings of the Conference on Extending Database Technology (EDBT), 2018, pp. 686–689, doi: 10.5441/002/edbt.2018.85.
Tools for visual data exploration allow users to visually browse through and analyze datasets to possibly reveal interesting infor- mation hidden in the data that users are a priori unaware of. Such tools rely on both query recommendations to select data to be visualized and visualization recommendations for these data to best support users in their visual data exploration process. EVLIN ( e xploring v isually with lin eage) is a system that assists users in visually exploring relational data stored in a data ware- house. EVLIN implements novel techniques for recommending both queries and their result visualization in an integrated and interactive way 3 . Recommendations rely on provenance (aka lineage) that describes the production process of displayed data . The demonstration of EVLIN includes an introduction to its features and functionality through sample exploration sessions. Conference attendees will then have the opportunity to gain hands- on experience of provenance-based visual data exploration by performing their own exploration sessions. These sessions will explore real-world data from several domains. While exploration sessions use a Web-based visual interface, the demonstration also features a researcher console, where attendees may have a look behind the scenes to get a more in-depth understanding of the underlying recommendation algorithms.
Fachgebiet (DDC): 004 Informatik
Visual data exploration allows users to analyze datasets based on visualizations of interesting data characteristics, to possibly discover interesting information about the data that users are a priori unaware of. In this context, both recommendations of queries selecting the data to be visualized and recommendations of visualizations that highlight interesting data characteristics support users in visual data exploration. So far, these two types of recommendations have been mostly considered in isolation of one another.
We present a recommendation approach for visual data exploration that unifies query recommendation and visualization recommendation. The recommendations rely on two types of provenance, i.e., data provenance (aka lineage) and evolution provenance that tracks users' interactions with a data exploration system. This paper presents the provenance data model as well as the overall system architecture. We then provide details on our provenance-based recommendation algorithms. A preliminary experimental evaluation showcases the applicability of our solution in practice.
M. Herschel, R. Diestelkämper, and H. Ben Lahmar, “A Survey on Provenance - What for? What form? What from?,” The VLDB Journal, vol. 26, pp. 881–906, 2017, doi: 10.1007/s00778-017-0486-1.
Provenance refers to any information describing the production process of an end product, which can be anything from a piece of digital data to a physical object. While this survey focuses on the former type of end product, this definition still leaves room for many different interpretations of and approaches to provenance. These are typically motivated by different application domains for provenance (e.g., accountability, reproducibility, process debugging) and varying technical requirements such as runtime, scalability, or privacy. As a result, we observe a wide variety of provenance types and provenance-generating methods. This survey provides an overview of the research field of provenance, focusing on what provenance is used for (what for?), what types of provenance have been defined and captured for the different applications (what form?), and which resources and system requirements impact the choice of deploying a particular provenance solution (what from?). For each of these three key questions, we provide a classification and review the state of the art for each class. We conclude with a summary and possible future research challenges.
M. A. Baazizi, H. Ben Lahmar, D. Colazzo, G. Ghelli, and C. Sartiani, “Schema Inference for Massive JSON Datasets,” in Proceedings of the Conference on Extending Database Technology (EDBT), 2017, pp. 222–233, doi: 10.5441/002/edbt.2017.21.
In the recent years JSON affirmed as a very popular dataformat for representing massive data collections. JSON datacollections are usually schemaless. While this ensures sev-eral advantages, the absence of schema information has im-portant negative consequences: the correctness of complexqueries and programs cannot be statically checked, userscannot rely on schema information to quickly figure out thestructural properties that could speed up the formulation ofcorrect queries, and many schema-based optimizations arenot possible.In this paper we deal with the problem of inferring aschema from massive JSON datasets. We first identify aJSON type language which is simple and, at the same time,expressive enough to capture irregularities and to give com-plete structural information about input data. We thenpresent our main contribution, which is the design of a schemainference algorithm, its theoretical study, and its implemen-tation based on Spark, enabling reasonable schema infer-ence time for massive collections. Finally, we report aboutan experimental analysis showing the effectiveness of our ap-proach in terms of execution time, precision, and concisenessof inferred schemas, and scalability.
R. Diestelkämper, M. Herschel, and P. Jadhav, “Provenance in DISC Systems: Reducing Space Overhead at Runtime,” in Proceedings of the USENIX Conference on Theory and Practice of Provenance (TAPP), 2017, pp. 1–13, [Online]. Available: https://dl.acm.org/doi/abs/10.5555/3183865.3183883.
Data intensive scalable computing (DISC) systems, such asApache Hadoop or Spark, allow to process large amountsof heterogenous data. For varying provenance applications,emerging provenance solutions for DISC systems track allsource data items through each processing step, imposing ahigh space and time overhead during program execution.We introduce a provenance collection approach that re-duces the space overhead at runtime by sampling the inputdata based on the definition of equivalence classes. A pre-liminary empirical evaluation shows that this approach al-lows to satisfy many use cases of provenance applications indebugging and data exploration, indicating that provenancecollection for a fraction of the input data items often suf-fices for selected provenance applications. When additionalprovenance is required, we further outline a method to col-lect provenance at query time, reusing, when possible, par-tial provenance already collected during program execution
M. Herschel and M. Hlawatsch, “Provenance: On and Behind the Screens,” in Proceedings of the ACM International Conference on the Management of Data (SIGMOD), 2016, pp. 2213–2217, doi: 10.1145/2882903.2912568.
Collecting and processing provenance, i.e., information describing the production process of some end product, is important in various applications, e.g., to assess quality, to ensure reproducibility, or to reinforce trust in the end product. In the past, different types of provenance meta-data have been proposed, each with a different scope. The first part of the proposed tutorial provides an overview and comparison of these different types of provenance. To put provenance to good use, it is essential to be able to interact with and present provenance data in a user-friendly way. Often, users interested in provenance are not necessarily experts in databases or query languages, as they are typically domain experts of the product and production process for which provenance is collected (biologists, journalists, etc.). Furthermore, in some scenarios, it is difficult to use solely queries for analyzing and exploring provenance data. The second part of this tutorial therefore focuses on enabling users to leverage provenance through adapted visualizations. To this end, we will present some fundamental concepts of visualization before we discuss possible visualizations for provenance.