Combination field filtering

Complex queries that involve multiple conditions and multiple fields incur index overhead. You can avoid this overhead if you know you’re working with a moderately-sized dataset by using the FQL Filter to compute and filter the results of an initial, larger dataset.

Although using indexes is recommended as the default query strategy, not all queries benefit from indexes. Fields with low cardinality can impact overall performance. For example, a field that represents a boolean value has low cardinality.

This tutorial shows how to improve the performance of an indexed query on low cardinality fields by using the Filter function. It then extends the example, using the Range function, to cover the more complex case of a field that holds a value within a range. Next, a more real-world problem is solved for by combining index and function solutions.

Conventional approaches using indexes

Using a shopping site as an illustrative example, where the site provides filtering on the Price, Type, Brand, and Customer Rating of product items, shows that a single index that contains the four fields as terms is inefficient.

The default solution is to find where the results intersect by querying each field separately and combining the results using four indexes, one for each field, and different query statements for each pair of fields solved for. As examples, a filter that contains only values for Type and Brand has the following query:

Intersection(
   Match(Index("products_by_type"), "some-type"),
   Match(Index("products_by_brand"), "some-brand")
)

and a filter that contains values for Type and Customer Rating has the following query:

Intersection(
   Match(Index("products_by_type"), "some-type"),
   Union(
      Map(
        [3, 4, 5],
        Lambda(
          "stars"
          Match(Index("products_by_rating"), Var("stars"))
        )
      )
   )
)

Continuing the pattern, the number of possible query statements becomes 24, or 16, even if only four distinct queries, one for each field, are needed. The main issues are:

  • Four indexes are needed. If there are many fields to filter on, your index overhead grows exponentially.

  • Functions such as Intersection, Union, Difference, and Join are computationally intensive. Combining them compounds the cost.

The Filter function

If the dataset is small to moderate size, you can simplify the query by using the Filter function with the comparison operators, as shown in the following example:

Paginate(
  Filter(
    Documents(Collection("Products")),
    Lambda(
      "x",
      Let(
        {
          p: Get(Var("x"))
        },
        And(
          GTE(Select(["data", "price"], Var("p")), 100, null),
          LTE(Select(["data", "price"], Var("p")), 200, null),
          Equals(Select(["data", "brand"], Var("p")), "some-brand", null),
          Equals(Select(["data", "type"], Var("p")), "some-type", null),
          GTE(Select(["data", "customerRating"], Var("p")), "3", null)
        )
      )
    )
  )
)

The Filter function returns only those elements for which the lambda function returns true. For example, computing the Customer Rating filter becomes more efficient because you can use the GTE comparison function.

Incorporating the Paginate snippet, you can replace all 16 queries in the index example with a single, reusable, user-defined function (UDF) that handles all filter field combinations:

CreateFunction({
  name: "my-filter",
  body: Query(
    Lambda(
      "filterParams",
      Paginate(
        Filter(
          Documents(Collection("Products")),
          Lambda("x",
            Let(
              { p: Get(Var("x")) },
              And(
                If(
                  ContainsField("minPrice", Var("filterParams")),
                  GTE(
                    Select(["data", "price"], Var("p")),
                    Select(["minPrice"], Var("filterParams"))
                  ),
                  true
                ),
                If(
                  ContainsField("maxPrice", Var("filterParams")),
                  LTE(
                    Select(["data", "price"], Var("p")),
                    Select(["maxPrice"], Var("filterParams"))
                  ),
                  true
                ),
                If(
                  ContainsField("brand", Var("filterParams")),
                  Equals(
                    Select(["data", "brand"], Var("p")),
                    Select(["brand"], Var("filterParams"))
                  ),
                  true
                ),
                If(
                  ContainsField("type", Var("filterParams")),
                  Equals(
                    Select(["data", "type"], Var("p")),
                    Select(["type"], Var("filterParams"))
                  ),
                  true
                ),
                If(
                  ContainsField("customerRating", Var("filterParams")),
                  GTE(
                    Select(["data", "customerRating"], Var("p")),
                    Select(["customerRating"], Var("filterParams"))
                  ),
                  true
                )
              )
            )
          )
        )
      )
    )
  )
})

You can call your my-filter UDF repeatedly with variable filter conditions, such as by Type and Brand:

Call("my-filter", {
  type: "some-type",
  brand: "some-brand"
})

Or, query by Type and Customer Rating:

Call("my-filter", {
  type: "some-type",
  customerRating: 3
})

Combining indexes with a filter

Practically, the optimum solution for a specific query pattern often relies on a combination of approaches. Continuing with the same example but assuming a very large dataset, such as a large product catalog, paginating through the entire collection might be inefficient but you can incorporate indexes to reduce the dataset size.

An effective index is one whose terms have high cardinality. In this example, Brand might be a good index if there are many brands. Low cardinality Type and Customer Rating filter options are poor choices for indexes.

Create the index first, then use it in your new query:

CreateFunction({
  name: "my-filter",
  body: Query(
    Lambda(
      "filterParams",
      Paginate(
        Filter(
          If(
            ContainsField("brand", Var("filterParams")),
            Match(
              Index("products_by_brand"),
              Select(["brand"], Var("filterParams"))
            ),
            Documents(Collection("Products"))
          ),
          Lambda(
            "x",
            Let(
              { p: Get(Var("x")) },
              And(
                If(
                  ContainsField("minPrice", Var("filterParams")),
                  GTE(
                    Select(["data", "price"], Var("p")),
                    Select(["minPrice"], Var("filterParams"))
                  ),
                  true
                ),
                If(
                  ContainsField("maxPrice", Var("filterParams")),
                  LTE(
                    Select(["data", "price"], Var("p")),
                    Select(["maxPrice"], Var("filterParams"))
                  ),
                  true
                ),
                If(
                  ContainsField("type", Var("filterParams")),
                  Equals(
                    Select(["data", "type"], Var("p")),
                    Select(["type"], Var("filterParams"))
                  ),
                  true
                ),
                If(
                  ContainsField("customerRating", Var("filterParams")),
                  GTE(
                    Select(["data", "customerRating"], Var("p")),
                    Select(["customerRating"], Var("filterParams"))
                  ),
                  true
                )
              )
            )
          )
        )
      )
    )
  )
})

When the query runs, the following snippet evaluates the index and matches on Brand, returning a smaller dataset before calling the Filter function:

If(
  ContainsField("brand", Var("filterParams")),
  Match(
    Index("products_by_brand"),
    Select(["brand"], Var("filterParams"))
  ),
  Documents(Collection("Products"))
),

Possible issues with this solution include:

  • If the search wants all brands, you still need to do a full scan of the Products collection.

  • A UX change might be needed to prevent runaway queries. In the example, you might need to make the search Brand field mandatory to keep searches performant.

Using the Range function

Site filter options for querying large datasets commonly include fields that hold a value with a range. Examples are date fields and minimum and maximum price fields. The following example adds a Price field to the previous example and leverages the FQL Range function in the solution.

To compute the range on price, you need an index that declares price as a value:

CreateIndex({
  name: "products_sorted_by_price"
  source: Collection("Products"),
  values: [
    { field: ["data", "price"] },
    { field: ["ref"] }
  ]
})

You can change the UDF to include the Range function as follows:

CreateFunction({
  name: "my-filter"
  body: Query(
    Lambda(
      "filterParams",
      Paginate(
        Filter(
          Range(
            If(
              ContainsField("brand", Var("filterParams")),
              Match(
                Index("products_by_brand_sorted_by_price"),
                Select(["brand"], Var("filterParams"))
              ),
              Match(Index("products_sorted_by_price")),
            ),
            Select(["minPrice"], Var("filterParams")),
            Select(["maxPrice"], Var("filterParams"))
          ),
          Lambda(
            ["price", "x"],
            Let(
              { p: Get(Var("x")) },
              And(
                If(
                  ContainsField("type", Var("filterParams")),
                  Equals(
                    Select(["data", "type"], Var("p")),
                    Select(["type"], Var("filterParams"))
                  ),
                  true
                ),
                If(
                  ContainsField("customerRating", Var("filterParams")),
                  GTE(
                    Select(["data", "customerRating"], Var("p")),
                    Select(["customerRating"], Var("filterParams"))
                  ),
                  true
                )
              )
            )
          )
        )
      )
    )
  )
})

In this example, Price is a mandatory filter and you can see that the Range function is incorporated in the query whether or not you’re filtering by Brand. When all Brands are included, a range scan on Price is applied to the Products collection, which is more efficient than scanning the entire collection.

Summary

Consider the effectiveness of the indexes and weigh them against their incurred overhead:

  • Does an index produce low or high cardinality?

  • How many fields are filterable?

  • Is the dataset large or small?

Efficient queries combine:

  • An index on a field with high cardinality.

  • Filtering on other values that have low cardinality.

  • The Range function if you know the field holds a range of values on which to force a mandatory filter.

Is this article helpful? 

Tell Fauna how the article can be improved:
Visit Fauna's forums or email docs@fauna.com

Thank you for your feedback!