Warning, /tutorial-developing-benchmarks/_episodes/01-analysis_scripts_and_snakemake.md is written in an unsupported language. File is not indexed.
0001 ---
0002 title: "Exercise 1: Analysis Scripts and Snakemake"
0003 teaching: 20
0004 exercises: 10
0005 questions:
0006 - "How does one set up data analysis workflows?"
0007 objectives:
0008 - "Learn basics of creating Snakemake workflows"
0009 keypoints:
0010 - "Snakemake allows one to run their data analyses and share them with others"
0011 ---
0012
0013 In this exercise we start with a ready-made analysis script that we're running locally. We'll practice using [Snakemake](https://snakemake.github.io/) workflow management system to define data analysis pipelines.
0014
0015 Snakemake comes with a great [documentation](https://snakemake.readthedocs.io), you are encouraged to read it. For now, let's cover its suggested use for needs of defining ePIC benchmarks.
0016
0017 ## Starting from an Analysis Script
0018
0019 We're going to go all the way from an analysis script to a fully-integragrated benchmark with GitLab's continuous integration (CI).
0020
0021 First launch eic-shell
0022 ```bash
0023 ./eic-shell
0024 ```
0025 then create a working directory
0026 ```bash
0027 mkdir tutorial_directory
0028 mkdir tutorial_directory/starting_script
0029 cd tutorial_directory/starting_script
0030 ```
0031
0032 Copy the following files to this working directory:
0033 - this analysis script: [`uchannelrho.cxx`](https://github.com/eic/tutorial-developing-benchmarks/blob/gh-pages/files/uchannelrho.cxx)
0034 - this plotting macro: [`plot_rho_physics_benchmark.C`](https://github.com/eic/tutorial-developing-benchmarks/blob/gh-pages/files/prefinal/plot_rho_physics_benchmark.C)
0035 - this style header: [`RiceStyle.h`](https://github.com/eic/tutorial-developing-benchmarks/blob/gh-pages/files/RiceStyle.h)
0036
0037 We will also start by running over a file from the simulation campaign. Download it to your workspace:
0038 ```bash
0039 xrdcp root://dtn-eic.jlab.org//work/eic2/EPIC/RECO/24.07.0/epic_craterlake/EXCLUSIVE/UCHANNEL_RHO/10x100/rho_10x100_uChannel_Q2of0to10_hiDiv.0020.eicrecon.tree.edm4eic.root ./
0040 ```
0041
0042 Organize files into `analysis` and `macros` directories:
0043 ```bash
0044 mkdir analysis
0045 mv uchannelrho.cxx analysis/
0046 mkdir macros
0047 mv plot_rho_physics_benchmark.C macros/
0048 mv RiceStyle.h macros/
0049 ```
0050
0051 Run the analysis script over the simulation campaign output:
0052 ```bash
0053 root -l -b -q 'analysis/uchannelrho.cxx+("rho_10x100_uChannel_Q2of0to10_hiDiv.0020.eicrecon.tree.edm4eic.root","output.root")'
0054 ```
0055 Now make a directory to contain the benchmark figures, and run the plotting macro:
0056 ```bash
0057 mkdir output_figures/
0058 root -l -b -q 'macros/plot_rho_physics_benchmark.C("output.root")'
0059 ```
0060
0061 You should see some errors like
0062 ~~~
0063 Error in <TTF::SetTextSize>: error in FT_Set_Char_Size
0064 ~~~
0065 {: .error}
0066 but the figures should be produced just fine. If everything's run correctly, then we have a working analysis!
0067
0068
0069 With this analysis as a starting point, we'll next explore using Snakemake to define an analysis workflow.
0070
0071
0072 ## Getting started with Snakemake
0073
0074 We'll now use a tool called Snakemake to define an analysis workflow that will come in handy when building analysis pipelines.
0075
0076 In order to demonstrate the advantages of using snakefiles, let's start using them for our analysis.
0077 First let's use snakemake to grab some simulation campaign files from the online storage space. In your `tutorial_directory/starting_script/` directory make a new file called `Snakefile`.
0078 Open the file and add these lines:
0079 ```python
0080 import os
0081
0082 # Set environment mode (local or eicweb)
0083 ENV_MODE = os.getenv("ENV_MODE", "local") # Defaults to "local" if not set
0084 # Output directory based on environment
0085 OUTPUT_DIR = "../../sim_output/" if ENV_MODE == "eicweb" else "sim_output/"
0086 # Benchmark directory based on environment
0087 BENCH_DIR = "benchmarks/your_benchmark/" if ENV_MODE == "eicweb" else "./"
0088
0089 rule your_benchmark_campaign_reco_get:
0090 output:
0091 f"{OUTPUT_DIR}rho_10x100_uChannel_Q2of0to10_hiDiv.{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic.root",
0092 shell: """
0093 xrdcp root://dtn-eic.jlab.org//work/eic2/EPIC/RECO/24.07.0/epic_craterlake/EXCLUSIVE/UCHANNEL_RHO/10x100/rho_10x100_uChannel_Q2of0to10_hiDiv.{wildcards.INDEX}.eicrecon.tree.edm4eic.root {output}
0094 """
0095 ```
0096
0097 If you're having trouble copying and pasting, you can also copy from [here](https://github.com/eic/tutorial-developing-benchmarks/blob/gh-pages/files/Snakefile).
0098
0099 Thinking ahead to when we want to put our benchmark on eicweb, we add this `ENV_MODE` variable which allows us to specify paths differently based on whether we're running locally or in GitLab's pipelines.
0100
0101 We also defined a new rule: `your_benchmark_campaign_reco_get`. This rule defines how to download a single file from the JLab servers to the location `sim_output`.
0102
0103 After saving the Snakefile, let's try running it.
0104
0105 The important thing to remember about Snakemake is that Snakemake commands behave like requests. So if I want Snakemake to produce a file called `output.root`, I would type `snakemake --cores 2 output.root`. If there is a rule for producing `output.root`, then Snakemake will find that rule and execute it. We've defined a rule to produce a file called `../../sim_output/rho_10x100_uChannel_Q2of0to10_hiDiv.{INDEX}.eicrecon.tree.edm4eic.root`, but really we can see from the construction of our rule that the `{INDEX}` is a wildcard, so we should put a number there instead. Checking out the [files on S3](https://dtn01.sdcc.bnl.gov:9001/buckets/eictest/browse/RVBJQy9SRUNPLzI0LjA3LjAvZXBpY19jcmF0ZXJsYWtlL0VYQ0xVU0lWRS9VQ0hBTk5FTF9SSE8vMTB4MTAwLw==), we see files with indices from `0000` up to `0048`. Let's request that Snakemake download the file `rho_10x100_uChannel_Q2of0to10_hiDiv.0005.eicrecon.tree.edm4eic.root`:
0106 ```bash
0107 snakemake --cores 2 sim_output/rho_10x100_uChannel_Q2of0to10_hiDiv.0000.eicrecon.tree.edm4eic.root
0108 ```
0109
0110 Snakemake now looks for the rule it needs to produce that file. It finds the rule we wrote, and it downloads the file. Check for the file:
0111 ```bash
0112 ls sim_output/
0113 rho_10x100_uChannel_Q2of0to10_hiDiv.0000.eicrecon.tree.edm4eic.root
0114 ```
0115
0116 Okay whatever... so we download a file. It doesn't look like Snakemake really adds anything at this point.
0117
0118 But the benefits from using Snakemake become more apparent as the number of tasks we want to do grows! Let's now add a new rule below the last one:
0119
0120 ```python
0121 rule your_benchmark_analysis:
0122 input:
0123 script=f"{BENCH_DIR}analysis/uchannelrho.cxx",
0124 data=f"{OUTPUT_DIR}rho_10x100_uChannel_Q2of0to10_hiDiv.{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic.root",
0125 output:
0126 plots=f"{OUTPUT_DIR}campaign_24.07.0_{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic/plots.root",
0127 shell:
0128 """
0129 mkdir -p $(dirname "{output.plots}")
0130 root -l -b -q '{input.script}+("{input.data}","{output.plots}")'
0131 """
0132 ```
0133
0134 This rule runs an analysis script to create ROOT files containing plots. The rule uses the simulation campaign file downloaded from JLab as input data, and it runs the analysis script `uchannelrho.cxx`.
0135
0136 Now let's request the output file `"sim_output/campaign_24.07.0_0005.eicrecon.tree.edm4eic/plots.root"`. When we request this, Snakemake will identify that it needs to run the new `your_benchmark_analysis` rule. But in order to do this, it now needs a file we don't have: `sim_output/rho_10x100_uChannel_Q2of0to10_hiDiv.0005.eicrecon.tree.edm4eic.root` because we only downloaded the file with index `0000` already. What Snakemake will do automatically is recognize that in order to get that file, it first needs to run the `your_benchmark_campaign_reco_get` rule. It will do this first, and then circle back to the `your_benchmark_analysis` rule.
0137
0138 Let's try it out:
0139 ```bash
0140 snakemake --cores 2 sim_output/campaign_24.07.0_0005.eicrecon.tree.edm4eic/plots.root
0141 ```
0142
0143 You should see something like this:
0144 ![Snakemake output second rule]({{ page.root }}/fig/snakemake_output_rule2_new.png)
0145 Check for the output file:
0146 ```bash
0147 ls sim_output/campaign_24.07.0_0005.eicrecon.tree.edm4eic/
0148 ```
0149 You should see `plots.root`.
0150
0151 That's still not very impressive. Snakemake gets more useful when we want to run the analysis code over a lot of files. Let's add a rule to do this:
0152
0153 ```python
0154 rule your_benchmark_combine:
0155 input:
0156 lambda wildcards: expand(
0157 f"{OUTPUT_DIR}campaign_24.07.0_{% raw %}{{INDEX:04d}}{% endraw %}.eicrecon.tree.edm4eic/plots.root",
0158 INDEX=range(int(wildcards.N)),
0159 ),
0160 wildcard_constraints:
0161 N="\d+",
0162 output:
0163 f"{OUTPUT_DIR}campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots.root",
0164 shell:
0165 """
0166 hadd {output} {input}
0167 """
0168 ```
0169
0170 On its face, this rule just adds root files using the `hadd` command. But by specifying the number of files you want to add, Snakemake will realize those files don't exist, and will go back to the `your_benchmark_campaign_reco_get` rule and the `your_benchmark_analysis` rule to create them.
0171
0172 Let's test it out by requesting it combine 10 files:
0173 ```bash
0174 snakemake --cores 2 sim_output/campaign_24.07.0_combined_10files.eicrecon.tree.edm4eic.plots.root
0175 ```
0176 It will spend some time downloading files and running the analysis code. Then it should hadd the files:
0177 ![Snakemake output third rule]({{ page.root }}/fig/snakemake_output_rule3_new.png)
0178
0179 Once it's done running, check that the file was produced:
0180 ```bash
0181 ls sim_output/campaign_24.07.0_combined_10files*
0182 ```
0183
0184 Now let's add one more rule to create benchmark plots:
0185 ```python
0186 rule your_benchmark_plots:
0187 input:
0188 script=f"{BENCH_DIR}macros/plot_rho_physics_benchmark.C",
0189 plots=f"{OUTPUT_DIR}campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots.root",
0190 output:
0191 f"{OUTPUT_DIR}campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots_figures/benchmark_rho_mass.pdf",
0192 shell:
0193 """
0194 if [ ! -d "{input.plots}_figures" ]; then
0195 mkdir "{input.plots}_figures"
0196 echo "{input.plots}_figures directory created successfully."
0197 else
0198 echo "{input.plots}_figures directory already exists."
0199 fi
0200 root -l -b -q '{input.script}("{input.plots}")'
0201 """
0202 ```
0203
0204 Now run the new rule by requesting a benchmark figure made from 10 simulation campaign files:
0205 ```bash
0206 snakemake --cores 2 sim_output/campaign_24.07.0_combined_10files.eicrecon.tree.edm4eic.plots_figures/benchmark_rho_mass.pdf
0207 ```
0208
0209 Now check that the three benchmark figures were created:
0210 ```
0211 ls sim_output/campaign_24.07.0_combined_10files.eicrecon.tree.edm4eic.plots_figures/*.pdf
0212 ```
0213 You should see three pdfs.
0214 We did it!
0215
0216 Now that our Snakefile is totally set up, the big advantage of Snakemake is how it manages your workflow.
0217 If you edit the plotting macro and then rerun:
0218 ```bash
0219 snakemake --cores 2 sim_output/campaign_24.07.0_combined_10files.eicrecon.tree.edm4eic.plots_figures/benchmark_rho_mass.pdf
0220 ```
0221 Snakemake will recognize that simulation campaign files have already been downloaded, that the analysis scripts have already run, and the files have already been combined. It will only run the last step, the plotting macro, if that's the only thing that needs to be re-run.
0222
0223 If the analysis script changes, Snakemake will only re-run the analysis script and everything after.
0224
0225 If we want to scale up the plots to include 15 simulation campaign files instead of just 10, then for those 5 extra files only Snakemake will rerun all the steps, and combine with the existing 10 files.
0226
0227
0228 The final Snakefile should look like this:
0229 ```python
0230 import os
0231
0232 # Set environment mode (local or eicweb)
0233 ENV_MODE = os.getenv("ENV_MODE", "local") # Defaults to "local" if not set
0234 # Output directory based on environment
0235 OUTPUT_DIR = "../../sim_output/" if ENV_MODE == "eicweb" else "sim_output/"
0236 # Benchmark directory based on environment
0237 BENCH_DIR = "benchmarks/your_benchmark/" if ENV_MODE == "eicweb" else "./"
0238
0239 rule your_benchmark_campaign_reco_get:
0240 output:
0241 f"{OUTPUT_DIR}rho_10x100_uChannel_Q2of0to10_hiDiv.{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic.root",
0242 shell: """
0243 xrdcp root://dtn-eic.jlab.org//work/eic2/EPIC/RECO/24.07.0/epic_craterlake/EXCLUSIVE/UCHANNEL_RHO/10x100/rho_10x100_uChannel_Q2of0to10_hiDiv.{wildcards.INDEX}.eicrecon.tree.edm4eic.root {output}
0244 """
0245
0246 rule your_benchmark_analysis:
0247 input:
0248 script=f"{BENCH_DIR}analysis/uchannelrho.cxx",
0249 data=f"{OUTPUT_DIR}rho_10x100_uChannel_Q2of0to10_hiDiv.{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic.root",
0250 output:
0251 plots=f"{OUTPUT_DIR}campaign_24.07.0_{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic/plots.root",
0252 shell:
0253 """
0254 mkdir -p $(dirname "{output.plots}")
0255 root -l -b -q '{input.script}+("{input.data}","{output.plots}")'
0256 """
0257
0258 rule your_benchmark_combine:
0259 input:
0260 lambda wildcards: expand(
0261 f"{OUTPUT_DIR}campaign_24.07.0_{% raw %}{{INDEX:04d}}{% endraw %}.eicrecon.tree.edm4eic/plots.root",
0262 INDEX=range(int(wildcards.N)),
0263 ),
0264 wildcard_constraints:
0265 N="\d+",
0266 output:
0267 f"{OUTPUT_DIR}campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots.root",
0268 shell:
0269 """
0270 hadd {output} {input}
0271 """
0272
0273 rule your_benchmark_plots:
0274 input:
0275 script=f"{BENCH_DIR}macros/plot_rho_physics_benchmark.C",
0276 plots=f"{OUTPUT_DIR}campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots.root",
0277 output:
0278 f"{OUTPUT_DIR}campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots_figures/benchmark_rho_mass.pdf",
0279 shell:
0280 """
0281 if [ ! -d "{input.plots}_figures" ]; then
0282 mkdir "{input.plots}_figures"
0283 echo "{input.plots}_figures directory created successfully."
0284 else
0285 echo "{input.plots}_figures directory already exists."
0286 fi
0287 root -l -b -q '{input.script}("{input.plots}")'
0288 """
0289
0290 ```
0291
0292
0293
0294 ## Conclusion
0295
0296 In this exercise we've built an analysis workflow using Snakemake. That required us to think about the flow of the data and come up with a file naming scheme to reflect it. This approach can be scaled between local testing with handful of files and largely parallel analyses on full datasets.
0297
0298 {% include links.md %}