Back to home page

EIC code displayed by LXR

 
 

    


Warning, /tutorial-developing-benchmarks/_episodes/01-analysis_scripts_and_snakemake.md is written in an unsupported language. File is not indexed.

0001 ---
0002 title: "Exercise 1: Analysis Scripts and Snakemake"
0003 teaching: 20
0004 exercises: 10
0005 questions:
0006 - "How does one set up data analysis workflows?"
0007 objectives:
0008 - "Learn basics of creating Snakemake workflows"
0009 keypoints:
0010 - "Snakemake allows one to run their data analyses and share them with others"
0011 ---
0012 
0013 In this exercise we start with a ready-made analysis script that we're running locally. We'll practice using [Snakemake](https://snakemake.github.io/) workflow management system to define data analysis pipelines.
0014 
0015 Snakemake comes with a great [documentation](https://snakemake.readthedocs.io), you are encouraged to read it. For now, let's cover its suggested use for needs of defining ePIC benchmarks.
0016 
0017 ## Starting from an Analysis Script
0018 
0019 We're going to go all the way from an analysis script to a fully-integragrated benchmark with GitLab's continuous integration (CI).
0020 
0021 First launch eic-shell 
0022 ```bash
0023 ./eic-shell
0024 ```
0025 then create a working directory
0026 ```bash
0027 mkdir tutorial_directory
0028 mkdir tutorial_directory/starting_script
0029 cd tutorial_directory/starting_script
0030 ```
0031 
0032 Copy the following files to this working directory:
0033 - this analysis script: [`uchannelrho.cxx`](https://github.com/eic/tutorial-developing-benchmarks/blob/gh-pages/files/uchannelrho.cxx)
0034 - this plotting macro: [`plot_rho_physics_benchmark.C`](https://github.com/eic/tutorial-developing-benchmarks/blob/gh-pages/files/prefinal/plot_rho_physics_benchmark.C)
0035 - this style header: [`RiceStyle.h`](https://github.com/eic/tutorial-developing-benchmarks/blob/gh-pages/files/RiceStyle.h)
0036 
0037 We will also start by running over a file from the simulation campaign. Download it to your workspace:
0038 ```bash
0039 xrdcp root://dtn-eic.jlab.org//work/eic2/EPIC/RECO/24.07.0/epic_craterlake/EXCLUSIVE/UCHANNEL_RHO/10x100/rho_10x100_uChannel_Q2of0to10_hiDiv.0020.eicrecon.tree.edm4eic.root ./
0040 ```
0041 
0042 Organize files into `analysis` and `macros` directories:
0043 ```bash
0044 mkdir analysis
0045 mv uchannelrho.cxx analysis/
0046 mkdir macros
0047 mv plot_rho_physics_benchmark.C macros/
0048 mv RiceStyle.h macros/
0049 ```
0050 
0051 Run the analysis script over the simulation campaign output:
0052 ```bash
0053 root -l -b -q 'analysis/uchannelrho.cxx+("rho_10x100_uChannel_Q2of0to10_hiDiv.0020.eicrecon.tree.edm4eic.root","output.root")'
0054 ```
0055 Now make a directory to contain the benchmark figures, and run the plotting macro:
0056 ```bash
0057 mkdir output_figures/
0058 root -l -b -q 'macros/plot_rho_physics_benchmark.C("output.root")'
0059 ```
0060 
0061 You should see some errors like
0062 ~~~
0063 Error in <TTF::SetTextSize>: error in FT_Set_Char_Size
0064 ~~~
0065 {: .error}
0066 but the figures should be produced just fine. If everything's run correctly, then we have a working analysis! 
0067 
0068 
0069 With this analysis as a starting point, we'll next explore using Snakemake to define an analysis workflow.
0070 
0071 
0072 ## Getting started with Snakemake
0073 
0074 We'll now use a tool called Snakemake to define an analysis workflow that will come in handy when building analysis pipelines.
0075 
0076 In order to demonstrate the advantages of using snakefiles, let's start using them for our analysis.
0077 First let's use snakemake to grab some simulation campaign files from the online storage space. In your `tutorial_directory/starting_script/` directory make a new file called `Snakefile`.
0078 Open the file and add these lines:
0079 ```python
0080 import os
0081 
0082 # Set environment mode (local or eicweb)
0083 ENV_MODE = os.getenv("ENV_MODE", "local")  # Defaults to "local" if not set
0084 # Output directory based on environment
0085 OUTPUT_DIR = "../../sim_output/" if ENV_MODE == "eicweb" else "sim_output/"
0086 # Benchmark directory based on environment
0087 BENCH_DIR = "benchmarks/your_benchmark/" if ENV_MODE == "eicweb" else "./"
0088 
0089 rule your_benchmark_campaign_reco_get:
0090     output:
0091         f"{OUTPUT_DIR}rho_10x100_uChannel_Q2of0to10_hiDiv.{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic.root",
0092     shell: """
0093 xrdcp root://dtn-eic.jlab.org//work/eic2/EPIC/RECO/24.07.0/epic_craterlake/EXCLUSIVE/UCHANNEL_RHO/10x100/rho_10x100_uChannel_Q2of0to10_hiDiv.{wildcards.INDEX}.eicrecon.tree.edm4eic.root {output}
0094 """
0095 ```
0096 
0097 If you're having trouble copying and pasting, you can also copy from [here](https://github.com/eic/tutorial-developing-benchmarks/blob/gh-pages/files/Snakefile).
0098 
0099 Thinking ahead to when we want to put our benchmark on eicweb, we add this `ENV_MODE` variable which allows us to specify paths differently based on whether we're running locally or in GitLab's pipelines.
0100 
0101 We also defined a new rule: `your_benchmark_campaign_reco_get`. This rule defines how to download a single file from the JLab servers to the location `sim_output`.
0102 
0103 After saving the Snakefile, let's try running it. 
0104 
0105 The important thing to remember about Snakemake is that Snakemake commands behave like requests. So if I want Snakemake to produce a file called `output.root`, I would type `snakemake --cores 2 output.root`. If there is a rule for producing `output.root`, then Snakemake will find that rule and execute it. We've defined a rule to produce a file called `../../sim_output/rho_10x100_uChannel_Q2of0to10_hiDiv.{INDEX}.eicrecon.tree.edm4eic.root`, but really we can see from the construction of our rule that the `{INDEX}` is a wildcard, so we should put a number there instead. Checking out the [files on S3](https://dtn01.sdcc.bnl.gov:9001/buckets/eictest/browse/RVBJQy9SRUNPLzI0LjA3LjAvZXBpY19jcmF0ZXJsYWtlL0VYQ0xVU0lWRS9VQ0hBTk5FTF9SSE8vMTB4MTAwLw==), we see files with indices from `0000` up to `0048`. Let's request that Snakemake download the file `rho_10x100_uChannel_Q2of0to10_hiDiv.0005.eicrecon.tree.edm4eic.root`:
0106 ```bash
0107 snakemake --cores 2 sim_output/rho_10x100_uChannel_Q2of0to10_hiDiv.0000.eicrecon.tree.edm4eic.root
0108 ```
0109 
0110 Snakemake now looks for the rule it needs to produce that file. It finds the rule we wrote, and it downloads the file. Check for the file:
0111 ```bash
0112 ls sim_output/
0113     rho_10x100_uChannel_Q2of0to10_hiDiv.0000.eicrecon.tree.edm4eic.root
0114 ```
0115 
0116 Okay whatever... so we download a file. It doesn't look like Snakemake really adds anything at this point.
0117 
0118 But the benefits from using Snakemake become more apparent as the number of tasks we want to do grows! Let's now add a new rule below the last one:
0119 
0120 ```python
0121 rule your_benchmark_analysis:
0122     input:
0123         script=f"{BENCH_DIR}analysis/uchannelrho.cxx",
0124         data=f"{OUTPUT_DIR}rho_10x100_uChannel_Q2of0to10_hiDiv.{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic.root",
0125     output:
0126         plots=f"{OUTPUT_DIR}campaign_24.07.0_{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic/plots.root",
0127     shell:
0128         """
0129 mkdir -p $(dirname "{output.plots}")
0130 root -l -b -q '{input.script}+("{input.data}","{output.plots}")'
0131 """
0132 ```
0133 
0134 This rule runs an analysis script to create ROOT files containing plots. The rule uses the simulation campaign file downloaded from JLab as input data, and it runs the analysis script `uchannelrho.cxx`.
0135 
0136 Now let's request the output file `"sim_output/campaign_24.07.0_0005.eicrecon.tree.edm4eic/plots.root"`. When we request this, Snakemake will identify that it needs to run the new `your_benchmark_analysis` rule. But in order to do this, it now needs a file we don't have: `sim_output/rho_10x100_uChannel_Q2of0to10_hiDiv.0005.eicrecon.tree.edm4eic.root` because we only downloaded the file with index `0000` already. What Snakemake will do automatically is recognize that in order to get that file, it first needs to run the `your_benchmark_campaign_reco_get` rule. It will do this first, and then circle back to the `your_benchmark_analysis` rule. 
0137 
0138 Let's try it out:
0139 ```bash
0140 snakemake --cores 2 sim_output/campaign_24.07.0_0005.eicrecon.tree.edm4eic/plots.root
0141 ```
0142 
0143 You should see something like this: 
0144 ![Snakemake output second rule]({{ page.root }}/fig/snakemake_output_rule2_new.png)
0145 Check for the output file:
0146 ```bash
0147 ls sim_output/campaign_24.07.0_0005.eicrecon.tree.edm4eic/
0148 ```
0149 You should see `plots.root`.
0150 
0151 That's still not very impressive. Snakemake gets more useful when we want to run the analysis code over a lot of files. Let's add a rule to do this:
0152 
0153 ```python
0154 rule your_benchmark_combine:
0155     input:
0156         lambda wildcards: expand(
0157            f"{OUTPUT_DIR}campaign_24.07.0_{% raw %}{{INDEX:04d}}{% endraw %}.eicrecon.tree.edm4eic/plots.root",
0158            INDEX=range(int(wildcards.N)),
0159         ),      
0160     wildcard_constraints:
0161         N="\d+",
0162     output:
0163         f"{OUTPUT_DIR}campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots.root",
0164     shell:
0165         """
0166 hadd {output} {input}
0167 """
0168 ```
0169 
0170 On its face, this rule just adds root files using the `hadd` command. But by specifying the number of files you want to add, Snakemake will realize those files don't exist, and will go back to the `your_benchmark_campaign_reco_get` rule and the `your_benchmark_analysis` rule to create them.
0171 
0172 Let's test it out by requesting it combine 10 files:
0173 ```bash
0174 snakemake --cores 2 sim_output/campaign_24.07.0_combined_10files.eicrecon.tree.edm4eic.plots.root
0175 ```
0176 It will spend some time downloading files and running the analysis code. Then it should hadd the files:
0177 ![Snakemake output third rule]({{ page.root }}/fig/snakemake_output_rule3_new.png)
0178 
0179 Once it's done running, check that the file was produced:
0180 ```bash
0181 ls sim_output/campaign_24.07.0_combined_10files*
0182 ```
0183 
0184 Now let's add one more rule to create benchmark plots:
0185 ```python
0186 rule your_benchmark_plots:
0187     input:
0188         script=f"{BENCH_DIR}macros/plot_rho_physics_benchmark.C",
0189         plots=f"{OUTPUT_DIR}campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots.root",
0190     output:
0191         f"{OUTPUT_DIR}campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots_figures/benchmark_rho_mass.pdf",
0192     shell:
0193         """
0194 if [ ! -d "{input.plots}_figures" ]; then
0195     mkdir "{input.plots}_figures"
0196     echo "{input.plots}_figures directory created successfully."
0197 else
0198     echo "{input.plots}_figures directory already exists."
0199 fi
0200 root -l -b -q '{input.script}("{input.plots}")'
0201 """
0202 ```
0203 
0204 Now run the new rule by requesting a benchmark figure made from 10 simulation campaign files:
0205 ```bash
0206 snakemake --cores 2 sim_output/campaign_24.07.0_combined_10files.eicrecon.tree.edm4eic.plots_figures/benchmark_rho_mass.pdf
0207 ```
0208 
0209 Now check that the three benchmark figures were created: 
0210 ```
0211 ls sim_output/campaign_24.07.0_combined_10files.eicrecon.tree.edm4eic.plots_figures/*.pdf
0212 ```
0213 You should see three pdfs. 
0214 We did it!
0215 
0216 Now that our Snakefile is totally set up, the big advantage of Snakemake is how it manages your workflow. 
0217 If you edit the plotting macro and then rerun:
0218 ```bash
0219 snakemake --cores 2 sim_output/campaign_24.07.0_combined_10files.eicrecon.tree.edm4eic.plots_figures/benchmark_rho_mass.pdf
0220 ```
0221 Snakemake will recognize that simulation campaign files have already been downloaded, that the analysis scripts have already run, and the files have already been combined. It will only run the last step, the plotting macro, if that's the only thing that needs to be re-run.
0222 
0223 If the analysis script changes, Snakemake will only re-run the analysis script and everything after.
0224 
0225 If we want to scale up the plots to include 15 simulation campaign files instead of just 10, then for those 5 extra files only Snakemake will rerun all the steps, and combine with the existing 10 files.
0226 
0227 
0228 The final Snakefile should look like this:
0229 ```python
0230 import os
0231 
0232 # Set environment mode (local or eicweb)
0233 ENV_MODE = os.getenv("ENV_MODE", "local")  # Defaults to "local" if not set
0234 # Output directory based on environment
0235 OUTPUT_DIR = "../../sim_output/" if ENV_MODE == "eicweb" else "sim_output/"
0236 # Benchmark directory based on environment
0237 BENCH_DIR = "benchmarks/your_benchmark/" if ENV_MODE == "eicweb" else "./"
0238 
0239 rule your_benchmark_campaign_reco_get:
0240     output:
0241         f"{OUTPUT_DIR}rho_10x100_uChannel_Q2of0to10_hiDiv.{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic.root",
0242     shell: """
0243 xrdcp root://dtn-eic.jlab.org//work/eic2/EPIC/RECO/24.07.0/epic_craterlake/EXCLUSIVE/UCHANNEL_RHO/10x100/rho_10x100_uChannel_Q2of0to10_hiDiv.{wildcards.INDEX}.eicrecon.tree.edm4eic.root {output}
0244 """
0245 
0246 rule your_benchmark_analysis:
0247     input:
0248         script=f"{BENCH_DIR}analysis/uchannelrho.cxx",
0249         data=f"{OUTPUT_DIR}rho_10x100_uChannel_Q2of0to10_hiDiv.{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic.root",
0250     output:
0251         plots=f"{OUTPUT_DIR}campaign_24.07.0_{% raw %}{{INDEX}}{% endraw %}.eicrecon.tree.edm4eic/plots.root",
0252     shell:
0253         """
0254 mkdir -p $(dirname "{output.plots}")
0255 root -l -b -q '{input.script}+("{input.data}","{output.plots}")'
0256 """
0257 
0258 rule your_benchmark_combine:
0259     input:
0260         lambda wildcards: expand(
0261            f"{OUTPUT_DIR}campaign_24.07.0_{% raw %}{{INDEX:04d}}{% endraw %}.eicrecon.tree.edm4eic/plots.root",
0262            INDEX=range(int(wildcards.N)),
0263         ),      
0264     wildcard_constraints:
0265         N="\d+",
0266     output:
0267         f"{OUTPUT_DIR}campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots.root",
0268     shell:
0269         """
0270 hadd {output} {input}
0271 """
0272 
0273 rule your_benchmark_plots:
0274     input:
0275         script=f"{BENCH_DIR}macros/plot_rho_physics_benchmark.C",
0276         plots=f"{OUTPUT_DIR}campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots.root",
0277     output:
0278         f"{OUTPUT_DIR}campaign_24.07.0_combined_{% raw %}{{N}}{% endraw %}files.eicrecon.tree.edm4eic.plots_figures/benchmark_rho_mass.pdf",
0279     shell:
0280         """
0281 if [ ! -d "{input.plots}_figures" ]; then
0282     mkdir "{input.plots}_figures"
0283     echo "{input.plots}_figures directory created successfully."
0284 else
0285     echo "{input.plots}_figures directory already exists."
0286 fi
0287 root -l -b -q '{input.script}("{input.plots}")'
0288 """
0289 
0290 ```
0291 
0292 
0293 
0294 ## Conclusion
0295 
0296 In this exercise we've built an analysis workflow using Snakemake. That required us to think about the flow of the data and come up with a file naming scheme to reflect it. This approach can be scaled between local testing with handful of files and largely parallel analyses on full datasets.
0297 
0298 {% include links.md %}