Skip to content

Query Cache

As a well behaving internet citizen, Toucan tries to avoid requesting repeatedly data from other services.

To avoid asking many times the same data from other servers, we use a cache.

Configuration

The cache

The cache is configured in config.yml

app_params:
   [...]
   query_cache:
    status: enable
    min_time: 50 ms

There are 2 differents configurations key: * status: can be enable or disable * min time: we do not want to cache each and every request. This is the minimum amount of time that a query must take before being put in cache. Default value is 50ms.

min_time units

min_time is a time delta represented as a string. Usable units are:

'ms': 'milliseconds',
's': 'seconds',
'sec': 'seconds',
'seconds': 'seconds',
'm': 'minutes',
'min': 'minutes',
'minutes': 'minutes',
'h': 'hours',
'hours': 'hours',
'd': 'days',
'day': 'days',
'days': 'days',
'w': 'weeks',
'weeks': 'weeks',

The cache database

The cache is stored in a redis database. The connection to this database can also be configured in config.yml

databases:
   [...]
   cache:
    host: redis
    port: 6379
    ssl: false
    database: 1
    default_ttl: 300

Usage

The cache should be entirely automatic, and no human action should be required.

Storing data in cache

if a query takes more than min_time (cf. configuration) to fetch data, the data will be stored in the redis instance defined in the database named cache in config.yml

Data will be stored at the key: {small_app_id}:{stage}:{local_or_live}:{hash(connector, datasource)}

The hash

the hash(connector, datasource) part is connector dependant.

This is what is hashed:

        unique_identifier = {
            'connector': self.get_unique_identifier(),
            'permissions': permissions,
            'offset': offset,
            'limit': limit,
            'datasource': self._render_datasource(data_source)
        } 
get_unique_identifier

This method returns a unique identifier for a connector. If two connectors are instantiated from the same configuration, their identifier should be equal.

The default implementation of get_unique_identifier() returns a json with ALL data from the connector. Some of this data might be irrelevant for hashing purpose. Connectors can reimplement it.

example: The MongoConnector contains a MongoClient instance, called client, that is used to communicate with the mongo server. If a mongo connector is instantiated twice, with the same parameters, with the default implementation, it will have 2 different hashes. It is not what we want, so we explicitely exclude this client from the hash:

    def get_unique_identifier(self) -> dict:
        return self.json(
            exclude={'client'}
        ) 
_render_datasource

In Toucan, datasources are parametrized. For instance, a pseudo-datasource such as

query: 'select * from {{ TABLE_NAME}}
parameters:
  TABLE_NAME: foobar

It should also be the same as:

query: 'select * from {{ TABLE_NAME}}
parameters:
  TABLE_NAME: foobar
  useless: 4242

the 'canon' version of such a datasource would be:

query: 'select * from foobar'

the _render_datasource method canonizes datasource. It applies the parameters to the query, and then drop parameters.

The default implementation uses nosql_apply_parameters_to_query to render the parameters, but this is overrideable by connectors. The Snowflake connector uses SqlQueryHelper.prepare_query

Cache invalidation

Automatic invalidation

The following events will invalidate the cache:

  • TTL (Time To Live): each cache entry will remain valid for a given time
  • refresh data: the refresh data operation will invalidate local staging cache entries for the small app.
  • publish: the publish operation will invalidate local production cache entries for the small app.

Live data will be invalidated ONLY when the TTL is expired Local data will be invalidated when the TTL is expired OR if we refresh data (staging only) OR if we publish (production only)

Manual invalidation

If it is needed, it is possible to manually wipe the cache by calling the following routes:

  • DELETE /query-cache to wipe the cache for all small apps

This can only be done by a super admin.

TTL

the TTL can be configured at 3 different places

by order of priority: * The query itself can have a TTL configured in the field cache_ttl

domain: "connectors_velib_live"
name: "OpenDataParis"
load: false
live_data: true
parameters: {}
url: "velib-disponibilite-en-temps-reel/records?rows=100"
filter: ".records[].record.fields | {capacity: .capacity, name: .name, available_bikes: .numbikesavailable, lat: .coordonnees_geo.lat, lon: .coordonnees_geo.lon }"
cache_ttl: 1s
  • the connector can have a TTL, also in the field cache_ttl
  • finally, a default TTL can be configured in config.yml