Data Pipeline Development
Motivation
In the field of machine learning or generally any kind of statistical data processing, creating data clean up pipelines, etc. are super critical. However, these pipelines are hard to maintain ensure reproducibility. Here we give general guidelines of how to setup a data pipeline, point you to reference repositories, etc. Our pipeline uses Python as the default programming language of choice not only for its ease of use but also for its versatility to interact with other computing environments.
Environment Setup
As with all the other the other python projects, we recommend the use of pyenv
and Poetry
for managing dependencies and the environment. Once you setup the project the main directory needs to look like what is shown below:
Project Directory Setup
We breakdown the pipeline code organization in terms of Objectives -> Studies -> Field Actions
Since every study has fields associated with it, you can put all the
In the entry point file ensure that all the actions are applied linearly.
The directories like compute
and cleanup
are derived from the objectives of statistical study.
TODO: Add more examples for more ML like ETL pipelines
Last updated