Thesis Jitse De Smet

Analyzing Security when Querying over Decentralized Environments

Jitse De Smet

Analyzing Security when Querying over Decentralized Environments

How to abstract data updates in a permissioned decentralized environment behind a query abstraction layer?

Jitse De Smet

How to abstract data updates in a permissioned decentralized environment behind a query abstraction layer?

Situate Thesis
Research Question and Hypothesis
The Past
The Future

Situate Thesis

Decentralization Efforts (like Solid )
- Heterogeneity of Interfaces
  (SPARQL-endpoint, LDP, ...)
- Heterogeneity of Data
  (I might have a smartwatch, you might not)
- Heterogeneity of Structure
  (I sort pictures by date, you by location)
Query Processing using SPARQL

As of right now, data on the Web is **increasingly centralized** in organizations like Google and Amazon. This centralization is at the root of privacy scandals and acts as a **vendor lock-in for consumers**. **Legislation** like GDPR opens the path for the adoption of decentralization initiatives. One of the initiatives is Solid, and this initiative in particular is the talk of the town since the **Flemish government heavily invests** in it. Solid allows users to host their own data through something called a Solid Pod. Rooted in the Solid project is the belief that we should build upon the existing web technologies. A single Solid pod can thus be seen as a node in a decentralized graph database, where each link in a pod can be any URI. Of course, users don't want just anyone to see all their data, and thus we need to add permission management, leaving us with a permisioned decentralized graph database. Since different users have **different requirements**, of their storage, different kinds of heterogeneity arise. In no particular order, we have the heterogeneity in interfaces, the heterogeneity of data and the heterogeneity of Structure. The **sentiment of supporting heterogeneity** is what sparked the creation of the Comunica query engine developed within this research unit.

Situate Thesis: Query Processing using SPARQL

Heterogeneity is hard for developers

Example: SPARQL Query for my selfies with Alice


SELECT * where
{
    ?picture a ex:picture ;
             ex:contains ex:Alice, ex:Bob ;
             ex:taken-by ex:Bob .
}

Example: SPARQL Query to add my selfie


INSERT DATA
{
    # a ex:picture ;
      ex:contains ex:Alice, ex:Bob ;
      ex:taken-by ex:Bob .
}

Situate Thesis: Solid Spec

Interface: LDP (RESTful)
Index:
- Type Index
- Shape Trees
Access Control:
- WAC
- ACP

Research Question and Hypothesis

"How to abstract data updates in a permissioned decentralized environment behind a query abstraction layer ?"

Data consumers don't interact with the interfaces directly.

Data stores can reject the actions of data consumers.

Data stores are small, distributed, and the owner is in control.

We use a query language (think SPARQL, SQL, ...) to add the abstraction.

Research Question and Hypothesis

The efforts of a developer to update data in a single data store can be significantly lowered by adding a query abstraction layer.
The efforts of a developer to update data in two data stores separately, where data stores have the same interface, but different structures can be significantly lowered by adding a query abstraction layer.
The efforts of a developer to perform a cross-data-store update where data stores can have different interfaces and different structures, can be significantly lowered by adding a query abstraction layer.
The number of additional http requests, compared to manually performing POST the required resources, required by an update-query engine will be small (<5).

The Past: step-by-step

Start by leaving the original idea
Read about querying
Think you will work on query optimization based on structural knowledge
Write shape descriptions for SolidBench
Read about it
Meet with promoter, get the "update query" hint
Read about update queries
Solidify the idea
Read some more specs
Get to work

The Past: Getting to Work

What is LDP?
Can we use Shape Trees for updates?

Example: LDP Container


<http://example.org/c1/>
   a ldp:BasicContainer;
   dcterms:title "A very simple container";
   ldp:contains <r1>, <r2>, <r3>.

Example: LDP Structure

pictures/
  |- Valencia/
  |  |- one.ttl
  |  |- two.ttl
  |- Ghent/
  |  |- one.ttl
  |  |- two.ttl
  |- Paris/
  |  |- one.ttl
  |  |- two.ttl
  |  |- three.ttl
  |- missing.ttl

pictures/
  |- 30-01-2024/
  |  |- one.ttl
  |  |- two.ttl
  |- 14-02-2024/
  |  |- one.ttl
  |  |- two.ttl
  |- 17-05-2023/
  |  |- one.ttl
  |  |- two.ttl
  |  |- three.ttl
  |  |- four.ttl

Example: SHACL Shape Description


ex:PictureShape
    a sh:NodeShape;
    sh:targetClass ex:Picture ;
    sh:property [
       sh:path ex:depicts ;
       sh:minCount 1 ;
       sh:maxCount 1 ;
       sh:datatype xsd:string ;
    ] ;
    sh:property [
        sh:path ex:contains ;
        sh:nodeKind sh:IRI ;
    ] .

Example: Shape Trees


<#PicturesTree>
  a st:ShapeTree ;
  st:expectsType st:Container ;
  st:shape ex:PicturesShape ;
  st:contains <#PicturesByCityTree> .

<#PicturesByCityTree>
  a st:ShapeTree ;
  st:expectsType st:Container ;
  st:shape ex:PicturesByCityShape ;
  st:contains <#PictureTree> .

<#PictureTree>
  a st:ShapeTree ;
  st:expectsType st:Resource ;
  st:shape ex:PictureShape .

Is this enough?
To check that, I listed some functional requirements and user stories.
The answer: NO.

As I've mentioned before, to limit the scope of my thesis, I focus on the current tech stack of Solid. Solid uses the **LDP interface** and **adds structural information** through Shape Trees used as an index. LDP provides some nice interface to essentially model a file system using Linked Data. Such a **file system can structure files in a variety of ways**. With the help of Shape Trees we can understand the structure. Shape Trees use shape descriptions like SHACL, or ShEx to describe resources. Put plainly, Shape trees are the **natural extension of shape descriptions to LDP**. Since "Shape Trees" provide structural information for read queries, they might be a good start to discover where we should write data. **Is this enough?**

The Past: Getting to Work

What if multiple directories match?
- Do I duplicate?
- Is one canonical and the other one links to the resource saved in the canonical?
- And how do I decide which one is canonical?
What if no directories match?
How are resources grouped?
- Can I just infer that picture-by-date example is just that?
- What if I need to create a new date directory?
Is that new directory I created a leaf?
- Or should I make even more directories? (Can be inferred from Shape Tree)
What to do if a resource is changed?
- Should I alter the Shape Tree?
- Should I move the resource?
- Do I have a distance metric, and do I move when the distance is to great?
Should all clients abide to the structural information?

Introducing: Storage Guidance Vocabulary (SGV)

The Future: Overview

Adapt Comunica to allow update queries by interpreting SGV
Alter SolidBench, so we can measure
Feedback Loop: Measure and Adapt

The Future: Evaluation

Experiments using SolidBench:

Extend SolidBench with SGV descriptions
Implement manual update scripts for each structure
Reason how to generalize the different scripts
Evaluate updating a single pod using queries
Evaluate updating multiple pods using queries

The Future: Evaluation

Possible metrics:

Execution time
Number of http requests
String difference between queries that want the same modification over different data stores
What ratio of queries leaves the data store inconsistent when introducing random server failures

Analyzing Security when Querying over Decentralized Environments

Analyzing Security when Querying over Decentralized Environments

How to abstract data updates in a permissioned decentralized environment behind a query abstraction layer?

How to abstract data updates in a permissioned decentralized environment behind a query abstraction layer?

Situate Thesis

Situate Thesis: Query Processing using SPARQL

Situate Thesis: Solid Spec

Research Question and Hypothesis

Research Question and Hypothesis

The Past: step-by-step

The Past: Getting to Work

The Past: Getting to Work

The Future: Overview

The Future: Evaluation

The Future: Evaluation

Time for Questions