Data modeling

Welcome back, fellow space developer! We are continuing our FQL space journey in this five-part tutorial.

In this third part of the tutorial, we’re going to take a look into the principles of modeling data with Fauna.

Introduction

Fauna is a rare breed in the world of databases as it allows you to model and query your data using different paradigms:

  • Relational

  • Documents (schemaless)

  • Temporal

  • Graph-like

Having the flexibility to switch between different models means that you can avoid common pitfalls inherent in each approach.

In this part of the tutorial, we focus on documents and relational modeling techniques.

Normalization and denormalization

Document-based databases typically require that you resort to data duplication (denormalization) to be able to produce the answers needed for your application and implement certain access patterns.

Here’s a very simplistic example. Say we have millions of stored chat messages with this format:

{
  author: "Admiral Ackbar",
  message: "It's a trap!",
  timestamp: 1591475572346
}

That would make it super fast to retrieve a list of messages with the name of the author in our SpaceChat app.

But what if our users now want to update their name? We’d potentially need to perform millions of write operations. This is slow and impractical, but also very expensive since most document-based databases charge you by the number of document operations.

If you’re using a database that does not support ACID transactions, big updates could even be dangerous as there is no guarantee about the final state of the transaction.

Fauna offers ACID transactions, so we don’t have this problem. Besides, since Fauna stores each document’s complete event history, you could roll back the document to any previous state (as far back as defined by history_days on the collection, which can be set to forever/indefinite, or ttl on each document).

In relational databases, this problem is non-existent as data is commonly normalized, or in other words, each bit of data is unique across the whole database. Normalization was born out of the necessity:

  1. to save money decades ago when storage was extremely expensive, but also

  2. to help maintain consistent data and avoid the problem of denormalization that we just saw.

In a relational database, the name of the user would only exist in a single row of the Users table:

USERS
Id      Name
123     Admiral Ackbar

MESSAGES
Id      Message          authorId
23462   It's a trap!     123

Cool, so now you’d only need to update the name in a single place.

Ah, but this introduces another problem. To retrieve a message, now the database needs to read data from multiple places in storage and join that data back together before returning it to the application.

In data modeling, it’s all about tradeoffs. Sometimes, you want to design your model for performance, for querying flexibility, or for cost.

Nested data

So far, we’ve been working with single entities stored on single document objects:

  • A pilot

  • A planet

  • A spaceship

You could get pretty far with simple documents and indexes, but at some point you need to model more complex data.

It’s possible to use a single document to store multiple data entities together for one-to-one and one-to-few relationships. Let’s say we wanted to model the weapons carried by our space pilots:

{
  name: "Flash Gordon",
  weapons: [
    {
      type: "LASER_GUN",
      damage: 12
    }
  ]
}

It seems almost natural in the document-based paradigm to model hierarchical data this way, but beware. There are some important points that need to be taken into consideration and may not be obvious.

How much data are we nesting?

We know it’s unlikely that our pilots would carry more than a dozen weapons. On the other hand, imagine we wanted to model galaxies, stars, planets, etc, for a SpaceMaps app. Since each galaxy can have billions of stars, this might not be a great idea:

{
  name: "Milky Way",
  stars: [
    {
      name: "Sun",
      brightness: −26.74,
      massKg: 2000000000000000000000000000000
    },
    // etc...
  ]
}

If you attempted to list all of the stars in the Milky Way like that, your galaxy document would become huge.

Documents can be, at most, 8 MB in size. Smaller documents are better, as they consume fewer read operations and are faster to read and write. See Limits for more details.

Will the data grow?

The nesting pattern is not a great fit for use cases where the data either could grow indefinitely, or is unbound.

For example, in SpaceAdvisor, our app for reviewing space hotels and restaurants, we definitely do not want to store reviews inside the properties documents:

{
  name: "The Nebula Gourmet",
  type: "RESTAURANT",
  reviews: [
    {
      title: "Delicious",
      stars: 5,
      message: "Best filet mignon in the whole quadrant!"
    },
    // etc...
  ]
}

Again, one problem here is that documents could potentially become huge and we don’t want that. Additionally, accessing heavily nested data often requires more coding, so it should only be done in the context of your access patterns, as we’ll discuss in the next few sections.

How are we going to query the data?

Another important consideration, before nesting entities in a single document, is knowing in advance what access patterns that we are going to need on those entities. This is critical in document-based databases as your design is usually coupled to your access patterns.

For example, you might think that it would be okay to store the planets and their moons together in our SpaceMaps app.

{
  name: "Earth",
  moons: [
    {
      name: "Luna",
      massKg: 73420000000000000000000
    }
  ]
}

You might assume that this pattern is perfect for this use case. After all, it’s unlikely that a planet would have more than, say, a couple hundred moons in some extreme cases and this number would practically never change.

But what if at some point we wanted to list and sort all the moons in our SpaceMaps app? This nesting pattern wouldn’t be appropriate for this use case as we wouldn’t be able to index and query the moons properly.

How often are we going to update the data?

Finally, even if your use case perfectly fits the nesting pattern, you have to consider how often you are going to update that document. The more frequently you update a document, the less efficient this pattern becomes.

References in Fauna

Previously, we saw that each document in Fauna has a reference that identifies it with a unique document ID inside a collection.

Here is a pilot document from a previous part of the tutorial:

{
  "ref": Ref(Collection("Pilots"), "266350546751848978"),
  "ts": 1590270525630000,
  "data": {
    "name": "Flash Gordon"
  }
}

The reference to this document has a unique document ID of 266350546751848978, but remember that, by itself, the document ID is not very useful. It only makes sense when paired with a collection to create a reference to a specific document.

References to other documents

Obviously, we can also use the Reference type to reference other documents. To demonstrate this, let’s revisit our SpaceMaps app.

We already have a Planets collection from a previous part of this tutorial. For reference, here’s the document for Earth:

{
  "ref": Ref(Collection("Planets"), "267081091831038483"),
  "ts": 1590977345595000,
  "data": {
    "name": "Earth",
    "type": "TERRESTRIAL",
    "color": "BLUE"
  }
}

Now, let’s create a Moons collection:

CreateCollection({name: "Moons"})

And let’s create a moon document with a reference to a planet document:

Create(
  Collection("Moons"),
  {
    data: {
      name: "Luna",
      planetRef: Ref(Collection("Planets"), "267081091831038483")
      // NOTE: be sure to use the correct document ID for your Earth document
    }
  }
)
{
  ref: Ref(Collection("Moons"), "267691276872188416"),
  ts: 1591549145540000,
  data: {
    name: "Luna",
    planetRef: Ref(Collection("Planets"), "267081091831038483")
  }
}

We just created a one-to-many relationship since it is possible for many moons to share the same planet.

We can now retrieve all of the moons in our database using the Documents function — to avoid creating an index, like we saw in a previous part of the tutorial:

Map(
  Paginate(Documents(Collection('Moons'))),
  Lambda("moonRef", Get(Var("moonRef")))
)

We could also create an index to find all of the moons for a given planet:

CreateIndex({
  name: "all_Moons_by_planet",
  source: [Collection("Moons")],
  terms: [{ field: ["data", "planetRef"] }]
})

Then we can query the index:

Map(
  Paginate(
    Match(
      Index("all_Moons_by_planet"),
      Ref(Collection("Planets"), "267081091831038483")
    )
  ),
  Lambda("moonRef", Get(Var("moonRef")))
)
If these FQL commands for indexes are confusing, you might want to revisit the previous part of the tutorial where all these are explained in detail.

One-to-one constraints

If you’d like to enforce a one-to-one relationship, you can do that by creating an index with a unique constraint, like we saw in the previous part of the tutorial:

CreateIndex({
  name: "only_one_Moon_per_planet",
  source: [
    Collection("Moons")
  ],
  terms: [
    {field: ["data", "planetRef"]}
  ],
  unique: true
})

Now, if we try to create another moon that is related to planet Earth, we get an error:

Create(
  Collection("Moons"),
  {
    data: {
      name: "Luna 2",
      planetRef: Ref(Collection("Planets"), "267081091831038483")
    }
  }
)
error: instance not unique
document is not unique.
position: ["create"]

Arrays of references

You could also model your one-to-few relationships by using arrays of references. This could be more convenient in certain situations where the data is frequently accessed together, but you still want to be able to query the entities independently and more efficiently.

For example, we could store the moon references in our planet document this way:

{
  "name": "Earth",
  "type": "TERRESTRIAL",
  "color": "BLUE",
  "moonRefs": [
    Ref(Collection("Moons"), "267691276872188416")
  ]
}

Then, to query a planet with all its moon documents you could use the Map, Let, and Select like we learned previously when querying indexes:

Let(
  {
    planetDoc: Get(Ref(Collection("Planets"), "267081091831038483"))
  },
  {
    planet: Var("planetDoc"),
    moons: Map(
      Select(["data", "moonRefs"], Var("planetDoc")),
      Lambda("moonRef", Get(Var("moonRef")))
    )
  }
)
{
  planet: {
    ref: Ref(Collection("Planets"), "267081091831038483"),
    ts: 1591554900130000,
    data: {
      name: "Earth",
      type: "TERRESTRIAL",
      color: "BLUE",
      moonRefs: [Ref(Collection("Moons"), "267691276872188416")]
    }
  },
  moons: [
    {
      ref: Ref(Collection("Moons"), "267691276872188416"),
      ts: 1591553627340000,
      data: {
        name: "Luna",
        planetRef: Ref(Collection("Planets"), "267081091831038483")
      }
    }
  ]
}

Or, if you only wanted to get the names and ids instead of the full documents, you could do this instead:

Let(
  {
    planetDoc: Get(Ref(Collection("Planets"), "267081091831038483"))
  },
  {
    planet: Let({}, {
      id: Select(["ref","id"], Var("planetDoc")),
      name: Select(["data","name"], Var("planetDoc")),
      moons: Map(
        Select(["data", "moonRefs"], Var("planetDoc")),
        Lambda("moonRef", Let(
          {
            moonDoc: Get(Var("moonRef"))
          },
          {
            id: Select(["ref","id"], Var("moonDoc")),
            name: Select(["data","name"], Var("moonDoc"))
          }
        ))
      )
    })
  }
)
{
  planet: {
    id: "267081091831038483",
    name: "Earth",
    moons: [
      {
        id: "267691276872188416",
        name: "Luna"
      }
    ]
  }
}

This query may seem intimidating, but if you inspect it in more detail, you find that it is only using FQL functions that we’ve already learned in the previous parts of this tutorial. FQL is a lot closer to a functional programming language than a declarative language like SQL, so it might help to think about it that way.

Many-to-many relationships

As we’ve seen in previous examples, many-to-many relationships can be expressed with indexes and/or arrays of references. Let’s look at a use case for that.

There is a bit of chaos in the dock and our boss, the admiral, has tasked us with creating a system for managing spaceship repairs. Armed with our new knowledge about references and relationships in Fauna, we should be able to solve this in no time!

First, we need to be able to track our personnel:

CreateCollection({name: "DockTechnicians"})

Let’s add all of the people working in the dock:

Create(
  Collection("DockTechnicians"),
  {
    data: {name: "Johnny Sparkles"}
  }
)

// etc...

We already have a Spaceships collection from our previous parts of the tutorial. For simplicity’s sake, we’re going to assume that all ships are in the dock right now.

Now, to assign a technician to a ship, we could just maintain an array of spaceships references in the technician document:

Update(
  Ref(Collection("DockTechnicians"), "267703813461246483"),
  {
    data: {
      workingOn: [
        Ref(Collection("Spaceships"), "266356873589948946")
      ]
    }
  }
)

Wait a minute!

The admiral specifically said he not only wanted to know which technicians worked on which spaceships, but also which repairs were in process and how long each repair took.

Join collections

In the relational world, it’s very common to model many-to-many relationships with an entity using a join table, or a bridging table.

This is done in part because, with a rigid schema, you’d need to add more columns to relate a row to other rows. But there is another important reason which actually solves our problem. When modeling a relationship with an independent entity, you can assign data to that relationship.

What if we actually had a collection to track the repairs that also associated technicians with spaceships?

CreateCollection({name: "DockRepairs"})

And now, let’s create the first repair:

Create(
  Collection("DockRepairs"),
  {
    data: {
      technicianRefs: [
        Ref(Collection("DockTechnicians"), "267703813461246483")
      ],
      shipRef: Ref(Collection("Spaceships"), "266356873589948946"),
      status: 'PENDING'
    }
  }
)
{
  ref: Ref(Collection("DockRepairs"), "267705685715714560"),
  ts: 1591562886860000,
  data: {
    technicianRefs: [
      Ref(Collection("DockTechnicians"), "267703813461246483")
    ],
    shipRef: Ref(Collection("Spaceships"), "266356873589948946"),
    status: "PENDING"
  }
}

So now we know the status of a ship repair and which technicians are assigned to it. Neat.

Let’s start a repair:

Update(
  Ref(Collection("DockRepairs"), "267705685715714560"),
  {
    data: {
      startTimestamp: Time('2355-02-11T05:23:11Z'),
      status: 'IN_PROCESS'
    }
  }
)
{
  ref: Ref(Collection("DockRepairs"), "267705685715714560"),
  ts: 1591563124590000,
  data: {
    technicianRefs: [
      Ref(Collection("DockTechnicians"), "267703813461246483")
    ],
    shipRef: Ref(Collection("Spaceships"), "266356873589948946"),
    status: "IN_PROCESS",
    startTimestamp: Time("2355-02-11T05:23:11Z")
  }
}

And now let’s finish the repair:

Update(
  Ref(Collection("DockRepairs"), "267705685715714560"),
  {
    data: {
      endTimestamp: Time('2355-02-11T03:05:35Z'),
      status: 'DONE'
    }
  }
)
{
  ref: Ref(Collection("DockRepairs"), "267705685715714560"),
  ts: 1591563210950000,
  data: {
    technicianRefs: [
      Ref(Collection("DockTechnicians"), "267703813461246483")
    ],
    shipRef: Ref(Collection("Spaceships"), "266356873589948946"),
    status: "DONE",
    startTimestamp: Time("2355-02-11T05:23:11Z"),
    endTimestamp: Time("2355-02-11T03:05:35Z")
  }
}

Great!

Now with this simple index, we can see all of their repairs and their info:

CreateIndex({
  name: "all_DockRepairs",
  source: [
    Collection("DockRepairs")
  ]
})

Not bad! But how long do repairs actually take?

We could have added a duration property to our documents and stored a value when ending a repair, but where is the fun in that? With FQL, there are other ways to accomplish this.

For example, we could just determine the duration when querying our data without having to implement it in our application logic:

Map(
  Paginate(Match(Index("all_DockRepairs"))),
  Lambda("repairRef", Let({
    repairDoc: Get(Var("repairRef"))
  },{
    durationMinutes: If(
      Or(
        IsNull(Select(["data", "startTimestamp"], Var("repairDoc"), null)),
        IsNull(Select(["data", "endTimestamp"], Var("repairDoc"), null))
      ),
      null,
      TimeDiff(
        Select(["data", "endTimestamp"], Var("repairDoc")),
        Select(["data", "startTimestamp"], Var("repairDoc")),
        "minutes"
      )
    ),
    repair: Var("repairDoc")
  }))
)
{
  data: [
    {
      durationMinutes: 137,
      repair: {
        ref: Ref(Collection("DockRepairs"), "267705685715714560"),
        ts: 1591563210950000,
        data: {
          technicianRefs: [
            Ref(Collection("DockTechnicians"), "267703813461246483")
          ],
          shipRef: Ref(Collection("Spaceships"), "266356873589948946"),
          status: "DONE",
          startTimestamp: Time("2355-02-11T05:23:11Z"),
          endTimestamp: Time("2355-02-11T03:05:35Z")
        }
      }
    }
  ]
}

Whoa.

Again, this query may seem complicated, but we already know most of this query works from previous sections in this tutorial.

Here’s the new part:

If(
  Or(
    IsNull(Select(["data", "startTimestamp"], Var("repairDoc"), null)),
    IsNull(Select(["data", "endTimestamp"], Var("repairDoc"), null))
  ),
  null,
  TimeDiff(
    Select(["data", "endTimestamp"], Var("repairDoc")),
    Select(["data", "startTimestamp"], Var("repairDoc")),
    "minutes"
  )
)

What this does is check that startTimestamp or endTimestamp are not missing from the document. If both exist, then return the time difference in minutes.

Let’s go step-by-step.

  • We already know what the Select function does from previous parts of the tutorial. In this case, we are giving it a default value of null if the path ["data", "startTimestamp"] does not exist in Var("repairDoc").

  • The IsNull function returns true if a value does not exist and Select returns null.

  • The Or function returns true if either startTimestamp or endTimestamp do not exist in the document. If that’s the case, then If would return null.

  • If both timestamps do exist in the repair document, we simply calculate the duration using the TimeDiff function with the minutes unit.

Index bindings

There is another way to solve this. Do you remember index bindings from the previous part of this tutorial? These are computed values calculated beforehand.

The decision to use index bindings comes down to how often you need to know the duration of the repairs. As explained in the previous part of the tutorial, bindings consume more storage but less CPU, so you have to account for that when deciding to use bindings.

That said, here’s a possible index to return the duration using a binding and the same conditional logic. We’re also returning a custom object instead of the full document, which might make more sense when listing repairs instead of showing a detail view to your users.

CreateIndex({
  name: "all_DockRepairs_with_duration",
  source: {
    collection: Collection("DockRepairs"),
    fields: {
      durationMinutes: Query(
        Lambda("repairDoc",
          If(
            Or(
              IsNull(
                Select(["data", "startTimestamp"], Var("repairDoc"), null)
              ),
              IsNull(
                Select(["data", "endTimestamp"], Var("repairDoc"), null)
              )
            ),
            null,
            TimeDiff(
              Select(["data", "endTimestamp"], Var("repairDoc")),
              Select(["data", "startTimestamp"], Var("repairDoc")),
              "minutes"
            )
          )
        )
      )
    }
  },
  values: [
    { binding: "durationMinutes"},
    { field: ["ref", "id"]},
    { field: ["data", "status"]}
  ]
})

Let’s query it:

Map(
  Paginate(Match(Index("all_DockRepairs_with_duration"))),
  Lambda("result", Let({},
    {
      id: Select([1], Var("result")),
      status: Select([2], Var("result")),
      durationMinutes: Select([0], Var("result"))
    }
  ))
)
{
  data: [
    {
      id: "267705685715714560",
      status: "DONE",
      durationMinutes: 137
    }
  ]
}

Conclusion

So that’s it for today. Hopefully you learned something valuable!

In part 4 of the tutorial, we continue our space adventure by learning how to create FQL functions to extend the basic functionality of Fauna with custom logic.

Is this article helpful? 

Tell Fauna how the article can be improved:
Visit Fauna's forums or email docs@fauna.com

Thank you for your feedback!