tutorial-developing-benchmarks/_episodes/01-analysis_scripts_and_snakemake.md

0001 ---
0002 title: "Exercise 1: Analysis Scripts and Snakemake"
0003 teaching: 20
0004 exercises: 10
0005 questions:
0006 - "How does one set up data analysis workflows?"
0007 objectives:
0008 - "Learn basics of creating Snakemake workflows"
0009 keypoints:
0010 - "Snakemake allows one to run their data analyses and share them with others"
0011 ---
0012
0013 In this exercise we start with a ready-made analysis script that we're running locally. We'll practice using [Snakemake](https://snakemake.github.io/) workflow management system to define data analysis pipelines.
0014
0015 Snakemake comes with a great [documentation](https://snakemake.readthedocs.io), you are encouraged to read it. For now, let's cover its suggested use for needs of defining ePIC benchmarks.
0016
0017 ## Starting from an Analysis Script
0018
0019 We're going to go all the way from an analysis script to a fully-integragrated benchmark with GitLab's continuous integration (CI).
0020
0021 First launch eic-shell
0022 ```bash
0023 ./eic-shell
0024 ```
0025 then create a working directory
0026 ```bash
0027 mkdir tutorial_directory
0028 mkdir tutorial_directory/starting_script
0029 cd tutorial_directory/starting_script
0030 ```
0031
0032 Copy the following files to this working directory:
0033 - this analysis script: [`uchannelrho.cxx`](https://github.com/eic/tutorial-developing-benchmarks/blob/gh-pages/files/uchannelrho.cxx)
0034 - this plotting macro: [`plot_rho_physics_benchmark.C`](https://github.com/eic/tutorial-developing-benchmarks/blob/gh-pages/files/prefinal/plot_rho_physics_benchmark.C)
0035 - this style header: [`RiceStyle.h`](https://github.com/eic/tutorial-developing-benchmarks/blob/gh-pages/files/RiceStyle.h)
0036
0037 We will also start by running over a file from the simulation campaign. Download it to your workspace:
0038 ```bash
0039 xrdcp root://dtn-eic.jlab.org//volatile/eic/EPIC/RECO/24.07.0/epic_craterlake/EXCLUSIVE/UCHANNEL_RHO/10x100/rho_10x100_uChannel_Q2of0to10_hiDiv.0020.eicrecon.tree.edm4eic.root ./
0040 ```
0041
0042 Organize files into `analysis` and `macros` directories:
0043 ```bash
0044 mkdir analysis
0045 mv uchannelrho.cxx analysis/
0046 mkdir macros
0047 mv plot_rho_physics_benchmark.C macros/
0048 mv RiceStyle.h macros/
0049 ```
0050
0051 Run the analysis script over the simulation campaign output:
0052 ```bash
0053 root -l -b -q 'analysis/uchannelrho.cxx+("rho_10x100_uChannel_Q2of0to10_hiDiv.0020.eicrecon.tree.edm4eic.root","output.root")'
0054 ```
0055 Now make a directory to contain the benchmark figures, and run the plotting macro:
0056 ```bash
0057 mkdir output_figures/
0058 root -l -b -q 'macros/plot_rho_physics_benchmark.C("output.root")'
0059 ```
0060
0061 You should see some errors like
0062 ~~~
0063 Error in <TTF::SetTextSize>: error in FT_Set_Char_Size
0064 ~~~
0065 {: .error}
0066 but the figures should be produced just fine. If everything's run correctly, then we have a working analysis!
0067
0068
0069 With this analysis as a starting point, we'll next explore using Snakemake to define an analysis workflow.
0070
0071
0072 ## Getting started with Snakemake
0073
0074 We'll now use a tool called Snakemake to define an analysis workflow that will come in handy when building analysis pipelines.
0075
0076 In order to demonstrate the advantages of using snakefiles, let's start using them for our analysis.
0077 First let's use snakemake to grab some simulation campaign files from the online storage space. In your `tutorial_directory/starting_script/` directory make a new file called `Snakefile`.
0078 Open the file and add these lines:
0079 ```python
0080 rule your_benchmark_campaign_reco_get:
0081     output:
0082         f"sim_output/rho_10x100_uChannel_Q2of0to10_hiDiv.{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic.root",
0083     shell: """
0084 xrdcp root://dtn-eic.jlab.org//volatile/eic/EPIC/RECO/24.07.0/epic_craterlake/EXCLUSIVE/UCHANNEL_RHO/10x100/rho_10x100_uChannel_Q2of0to10_hiDiv.{wildcards.INDEX}.eicrecon.tree.edm4eic.root {output}
0085 """
0086 ```
0087
0088 If you're having trouble copying and pasting, you can also copy from [here](https://github.com/eic/tutorial-developing-benchmarks/blob/gh-pages/files/Snakefile).
0089
0090 We also defined a new rule: `your_benchmark_campaign_reco_get`. This rule defines how to download a single file from the JLab servers to the location `sim_output`.
0091
0092 After saving the Snakefile, let's try running it.
0093
0094 The important thing to remember about Snakemake is that Snakemake commands behave like requests. So if I want Snakemake to produce a file called `output.root`, I would type `snakemake --cores 2 output.root`. If there is a rule for producing `output.root`, then Snakemake will find that rule and execute it. We've defined a rule to produce a file called `../../sim_output/rho_10x100_uChannel_Q2of0to10_hiDiv.{INDEX}.eicrecon.tree.edm4eic.root`, but really we can see from the construction of our rule that the `{INDEX}` is a wildcard, so we should put a number there instead. Checking out the [files on S3](https://dtn01.sdcc.bnl.gov:9001/buckets/eictest/browse/RVBJQy9SRUNPLzI0LjA3LjAvZXBpY19jcmF0ZXJsYWtlL0VYQ0xVU0lWRS9VQ0hBTk5FTF9SSE8vMTB4MTAwLw==), we see files with indices from `0000` up to `0048`. Let's request that Snakemake download the file `rho_10x100_uChannel_Q2of0to10_hiDiv.0005.eicrecon.tree.edm4eic.root`:
0095 ```bash
0096 snakemake --cores 2 sim_output/rho_10x100_uChannel_Q2of0to10_hiDiv.0000.eicrecon.tree.edm4eic.root
0097 ```
0098
0099 Snakemake now looks for the rule it needs to produce that file. It finds the rule we wrote, and it downloads the file. Check for the file:
0100 ```bash
0101 ls sim_output/
0102     rho_10x100_uChannel_Q2of0to10_hiDiv.0000.eicrecon.tree.edm4eic.root
0103 ```
0104
0105 Okay whatever... so we download a file. It doesn't look like Snakemake really adds anything at this point.
0106
0107 But the benefits from using Snakemake become more apparent as the number of tasks we want to do grows! Let's now add a new rule below the last one:
0108
0109 ```python
0110 rule your_benchmark_analysis:
0111     input:
0112         script=workflow.source_path("analysis/uchannelrho.cxx"),
0113         data=f"sim_output/rho_10x100_uChannel_Q2of0to10_hiDiv.{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic.root",
0114     output:
0115         plots=f"sim_output/campaign_24.07.0_{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic/plots.root",
0116     shell:
0117         """
0118 mkdir -p $(dirname "{output.plots}")
0119 root -l -b -q '{input.script}+("{input.data}","{output.plots}")'
0120 """
0121 ```
0122
0123 This rule runs an analysis script to create ROOT files containing plots. The rule uses the simulation campaign file downloaded from JLab as input data, and it runs the analysis script `uchannelrho.cxx`. Note that we use `workflow.source_path()` to reference the script - this function returns the correct path to the script relative to the Snakefile's location in the benchmark directory.
0124
0125 Now let's request the output file `"sim_output/campaign_24.07.0_0005.eicrecon.tree.edm4eic/plots.root"`. When we request this, Snakemake will identify that it needs to run the new `your_benchmark_analysis` rule. But in order to do this, it now needs a file we don't have: `sim_output/rho_10x100_uChannel_Q2of0to10_hiDiv.0005.eicrecon.tree.edm4eic.root` because we only downloaded the file with index `0000` already. What Snakemake will do automatically is recognize that in order to get that file, it first needs to run the `your_benchmark_campaign_reco_get` rule. It will do this first, and then circle back to the `your_benchmark_analysis` rule.
0126
0127 Let's try it out:
0128 ```bash
0129 snakemake --cores 2 sim_output/campaign_24.07.0_0005.eicrecon.tree.edm4eic/plots.root
0130 ```
0131
0132 You should see something like this:
0133 ![Snakemake output second rule]({{ page.root }}/fig/snakemake_output_rule2_new.png)
0134 Check for the output file:
0135 ```bash
0136 ls sim_output/campaign_24.07.0_0005.eicrecon.tree.edm4eic/
0137 ```
0138 You should see `plots.root`.
0139
0140 That's still not very impressive. Snakemake gets more useful when we want to run the analysis code over a lot of files. Let's add a rule to do this:
0141
0142 ```python
0143 rule your_benchmark_combine:
0144     input:
0145         lambda wildcards: expand(
0146            f"sim_output/campaign_24.07.0_{% raw %}{{INDEX:04d}}{% endraw %}.eicrecon.tree.edm4eic/plots.root",
0147            INDEX=range(int(wildcards.N)),
0148         ),
0149     wildcard_constraints:
0150         N="\d+",
0151     output:
0152         f"sim_output/campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots.root",
0153     shell:
0154         """
0155 hadd {output} {input}
0156 """
0157 ```
0158
0159 On its face, this rule just adds root files using the `hadd` command. But by specifying the number of files you want to add, Snakemake will realize those files don't exist, and will go back to the `your_benchmark_campaign_reco_get` rule and the `your_benchmark_analysis` rule to create them.
0160
0161 Let's test it out by requesting it combine 10 files:
0162 ```bash
0163 snakemake --cores 2 sim_output/campaign_24.07.0_combined_10files.eicrecon.tree.edm4eic.plots.root
0164 ```
0165 It will spend some time downloading files and running the analysis code. Then it should hadd the files:
0166 ![Snakemake output third rule]({{ page.root }}/fig/snakemake_output_rule3_new.png)
0167
0168 Once it's done running, check that the file was produced:
0169 ```bash
0170 ls sim_output/campaign_24.07.0_combined_10files*
0171 ```
0172
0173 Now let's add one more rule to create benchmark plots:
0174 ```python
0175 rule your_benchmark_plots:
0176     input:
0177         script=workflow.source_path("macros/plot_rho_physics_benchmark.C"),
0178         plots=f"sim_output/campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots.root",
0179     output:
0180         f"sim_output/campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots_figures/benchmark_rho_mass.pdf",
0181     shell:
0182         """
0183 if [ ! -d "{input.plots}_figures" ]; then
0184     mkdir "{input.plots}_figures"
0185     echo "{input.plots}_figures directory created successfully."
0186 else
0187     echo "{input.plots}_figures directory already exists."
0188 fi
0189 root -l -b -q '{input.script}("{input.plots}")'
0190 """
0191 ```
0192
0193 Now run the new rule by requesting a benchmark figure made from 10 simulation campaign files:
0194 ```bash
0195 snakemake --cores 2 sim_output/campaign_24.07.0_combined_10files.eicrecon.tree.edm4eic.plots_figures/benchmark_rho_mass.pdf
0196 ```
0197
0198 Now check that the three benchmark figures were created:
0199 ```
0200 ls sim_output/campaign_24.07.0_combined_10files.eicrecon.tree.edm4eic.plots_figures/*.pdf
0201 ```
0202 You should see three pdfs.
0203 We did it!
0204
0205 Now that our Snakefile is totally set up, the big advantage of Snakemake is how it manages your workflow.
0206 If you edit the plotting macro and then rerun:
0207 ```bash
0208 snakemake --cores 2 sim_output/campaign_24.07.0_combined_10files.eicrecon.tree.edm4eic.plots_figures/benchmark_rho_mass.pdf
0209 ```
0210 Snakemake will recognize that simulation campaign files have already been downloaded, that the analysis scripts have already run, and the files have already been combined. It will only run the last step, the plotting macro, if that's the only thing that needs to be re-run.
0211
0212 If the analysis script changes, Snakemake will only re-run the analysis script and everything after.
0213
0214 If we want to scale up the plots to include 15 simulation campaign files instead of just 10, then for those 5 extra files only Snakemake will rerun all the steps, and combine with the existing 10 files.
0215
0216
0217 The final Snakefile should look like this:
0218 ```python
0219 rule your_benchmark_campaign_reco_get:
0220     output:
0221         f"sim_output/rho_10x100_uChannel_Q2of0to10_hiDiv.{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic.root",
0222     retries: 3
0223     shell: """
0224 xrdcp root://dtn-eic.jlab.org//volatile/eic/EPIC/RECO/24.07.0/epic_craterlake/EXCLUSIVE/UCHANNEL_RHO/10x100/rho_10x100_uChannel_Q2of0to10_hiDiv.{wildcards.INDEX}.eicrecon.tree.edm4eic.root {output}
0225 """
0226
0227 rule your_benchmark_analysis:
0228     input:
0229         script=workflow.source_path("analysis/uchannelrho.cxx"),
0230         data=f"sim_output/rho_10x100_uChannel_Q2of0to10_hiDiv.{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic.root",
0231     output:
0232         plots=f"sim_output/campaign_24.07.0_{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic/plots.root",
0233     shell:
0234         """
0235 mkdir -p $(dirname "{output.plots}")
0236 root -l -b -q '{input.script}+("{input.data}","{output.plots}")'
0237 """
0238
0239 rule your_benchmark_combine:
0240     input:
0241         lambda wildcards: expand(
0242            f"sim_output/campaign_24.07.0_{% raw %}{{INDEX:04d}}{% endraw %}.eicrecon.tree.edm4eic/plots.root",
0243            INDEX=range(int(wildcards.N)),
0244         ),
0245     wildcard_constraints:
0246         N="\d+",
0247     output:
0248         f"sim_output/campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots.root",
0249     shell:
0250         """
0251 hadd {output} {input}
0252 """
0253
0254 rule your_benchmark_plots:
0255     input:
0256         script=workflow.source_path("macros/plot_rho_physics_benchmark.C"),
0257         plots=f"sim_output/campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots.root",
0258     output:
0259         f"sim_output/campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots_figures/benchmark_rho_mass.pdf",
0260     shell:
0261         """
0262 if [ ! -d "{input.plots}_figures" ]; then
0263     mkdir "{input.plots}_figures"
0264     echo "{input.plots}_figures directory created successfully."
0265 else
0266     echo "{input.plots}_figures directory already exists."
0267 fi
0268 root -l -b -q '{input.script}("{input.plots}")'
0269 """
0270
0271 ```
0272
0273
0274
0275 ## Conclusion
0276
0277 In this exercise we've built an analysis workflow using Snakemake. That required us to think about the flow of the data and come up with a file naming scheme to reflect it. This approach can be scaled between local testing with handful of files and largely parallel analyses on full datasets.
0278
0279 {% include links.md %}