Data processing
Data processing¶
During the operation preprocess_data_sources (called "Refresh data" in the front-end), data is:
- extracted from the DATA_SOURCES described in the etl_config.cson file
- pre-processing python functions written in the augment.py file are executed according to the etl_config.PIPELINE specification
- dataframes with load: true (either in DATA_SOURCES or PIPELINE) are loaded into MongoDB collections
- YouPrep pipelines described in the materialized queries are executed and stored
Augment¶
- what ?
- when ?
Materialized queries (formerly prepared datasets)¶
A materialized query (a query with materialized=true) is a dataset computed thanks to Pandas and stored in a collection.
Specification¶
A materialized query is specified by its name and the query used to construct it. We call this set of parameters its specification, to disambiguate from the materialized query data itself.
Their specification is stored in a queries-staging collection.
Each dataset is identified by a name, and have one other property:
- vqb_pipeline: the data transformation steps in the Visual Query Builder format,
As the translator from vqb_pipeline to query belongs to the Visual Query Builder project (weaverbird, written in Typescript), it can't be translated in Laputa directly.
On the other hand, it's not possible to deduce vqb_pipeline from query.
That's why we need to store both.
Materialized queries can depend on one another: changing the vqb_pipeline of one can change the query of others.
Therefore, upon each modification, all materialized queries set of 'vqb_pipeline's must be regenerated.
That's also the reason that when exported or imported, all the materialized queries specifications are grouped in one single document, with names as keys:
{
"my_new_domain": { // Identifier/name of the materialized query
"vqb_pipeline": [ ... ], // the data transformation steps in the Visual Query Builder format
"query": [ ... ], // the mongo aggregation pipeline, translated from the `vqb_pipeline`
},
"my_other_domain": {
"vqb_pipeline": [ ... ],
"query": [ ... ]
},
...
}
API routes¶
To retrieve the specification of materialized queries: GET /
Partial execution¶
The operation preprocess_data_sources can take several extra arguments, to restrict the scope of the refresh to certain domains only:
- input_domains: all the loaded domains that depend on these domains will be refreshed
- output_domains: all the loaded domains that are recomputed while refreshing these outputs will be refreshed as well
(See laputa.common.etl.ETLPipeline::select_domains_to_load for implementation details)
Note: the same rules applies for the release_domains operation.