Automating analyses with make
Automating analyses using make
Automated analyses?
- What if analysis depends on many files?
- Need to redo analysis with new data?
- What if analysis has several steps in a particular order?
Build manager: make
Tools like “Make” were developed to help compile complex software, but can also be used to automate any workflow.
How does Make work?
- Each time the operating system creates, reads, or changes a file, it updates a timestamp on the file. Make compares these timestamps.
- User describes which files depend on each other by writing rules in a
Makefile
. - Rules tell Make how to update an out-of-date file.
- When running Make, it checks all the rules and runs the commands needed to update those that are out of date. If transitive dependencies, then Make traces through to run rules in the right order.
Update single file
Makefile
# regenerate results
results/moby_dick.csv : data/moby_dick.txt
python src/countwords.py \
data/moby_dick.txt > results/moby_dick.csv
#
indicates a comment- 2nd and 3rd lines: build rule, using format
target : prerequisite
- backslash (
\
) splits line - recipe consists of 1+ shell commands, prefixed by single tab character (no spaces)
Run using command make
What happens?
- If
results/moby_dick.csv
doesn’t exist, Make runs recipe to create it - If
data/moby_dick.txt
is newer thanresults/moby_dick.csv
, Make runs recipe to update it - If
results/moby_dick.csv
is newer than its prerequisite, nothing happens
Managing multiple files
Makefile
# regenerate results for "Moby Dick"
results/moby_dick.csv : data/moby_dick.txt
python src/countwords.py data/moby_dick.txt > results/moby_dick.csv
# regenerate results for "Jane Eyre"
results/jane_eyre.csv : data/jane_eyre.txt
python src/countwords.py data/jane_eyre.txt > results/jane_eyre.csv
What happens?
By default, Make only attempts to update the first target (default target)
Could specify target directly: make results/jane_eyre.csv
Better, create “phony target” and place at top: all
# regenerate all results
all : results/moby_dick.csv results/jane_eyre.csv
...
Then type make all
Other phony target: clean
By convention a clean target provides rules to remove results/generated outputs
# remove all generated files
clean :
rm -rf results/*.csv
Then type make clean
. Safer than manually typing!
Problem if file/directory named clean
. Avoid this by explicitly telling phony targets at top of file:
.PHONY : all clean
Add programs to prerequisites
The results also depend on the programs used to generate them, so add to prerequisites:
# regenerate results for "Moby Dick"
results/moby_dick.csv : data/moby_dick.txt src/countwords.py
python src/countwords.py data/moby_dick.txt > results/moby_dick.csv
# regenerate results for "Jane Eyre"
results/jane_eyre.csv : data/jane_eyre.txt src/countwords.py
python src/countwords.py data/jane_eyre.txt > results/jane_eyre.csv
Reducing repetition: variables
Makefile
.PHONY : all clean
COUNT=src/countwords.py
RUN_COUNT=python $(COUNT)
# regenerate all results
all : results/moby_dick.csv results/jane_eyre.csv
# regenerate results for "Moby Dick"
results/moby_dick.csv : data/moby_dick.txt $(COUNT)
$(RUN_COUNT) data/moby_dick.txt > results/moby_dick.csv
# regenerate results for "Jane Eyre"
results/jane_eyre.csv : data/jane_eyre.txt $(COUNT)
$(RUN_COUNT) data/jane_eyre.txt > results/jane_eyre.csv
# remove all generated files
clean :
rm -f results/*.csv
Automatic variables
Automatic variable for target of the rule: $@
# regenerate results for "Moby Dick"
results/moby_dick.csv : data/moby_dick.txt $(COUNT)
$(RUN_COUNT) data/moby_dick.txt > $@
The first prerequisite of the rule: $<
# regenerate results for "Moby Dick"
results/moby_dick.csv : data/moby_dick.txt $(COUNT)
$(RUN_COUNT) $< > $@
Also: all prerequisites of the rule: $^
Generic rules
Create pattern rule using wildcard: %
results/%.csv : data/%.txt $(COUNT)
$(RUN_COUNT) $< > $@
So full Makefile is:
Makefile
.PHONY : all clean
COUNT=src/countwords.py
RUN_COUNT=python $(COUNT)
# regenerate all results
all : results/moby_dick.csv results/jane_eyre.csv \
results/time_machine.csv
# regenerate results for any book
results/%.csv : data/%.txt $(COUNT)
$(RUN_COUNT) $< > $@
# remove all generated files
clean :
rm -f results/*.csv
Define sets of files
Use variable to list all results files present:
RESULTS=results/*.csv
all : $(RESULTS)
But, only works if results already exist. Instead, use list of files data/
directory.
DATA=$(wildcard data/*.txt)
Use pattern substitution to create corresponding output files:
RESULTS=$(patsubst data/%.txt,results/%.csv,$(DATA))
settings
target
Use settings
target to print variables, using @
to avoid repeating command in output:
# ... rest of Makefile
# show variables' values
settings :
@echo COUNT: $(COUNT)
@echo DATA: $(DATA)
@echo RESULTS: $(RESULTS)
Further streamlining
Remove RUN_COUNT
variable:
# regenerate results for any book
results/%.csv : data/%.txt $(COUNT)
python $(COUNT) $< > $@
Since all depends on $(RESULTS)
we can regenerate in one step:
make clean
make
Documenting a Makefile
Create a phony target help
to print commands:
.PHONY: all clean help settings
# ... other definitions ...
# show help
help :
@echo "all : regenerate all results."
@echo "results/*.csv : regenerate result for any book."
@echo "clean : remove all generated files."
@echo "settings : show variables' values."
@echo "help : show this message."
Problem with this? It requires manual updates.
“Auto”-documenting a Makefile
Use ##
to mark lines to display and grep
to pull lines:
Makefile
.PHONY: all clean help settings
COUNT=src/countwords.py
DATA=$(wildcard data/*.txt)
RESULTS=$(patsubst data/%.txt,results/%.csv,$(DATA))
## all : regenerate all results.
all : $(RESULTS)
## results/%.csv : regenerate result for any book.
results/%.csv : data/%.txt $(COUNT)
python $(COUNT) $< > $@
## clean : remove all generated files.
clean :
rm -f $(RESULTS)
## settings : show variables' names
settings :
@echo COUNT: $(COUNT)
@echo DATA: $(DATA)
@echo RESULTS: $(RESULTS)
## help : show this message
help :
@grep '^##' ./Makefile
Other uses for Make ?
- Use Make to automate analyses
- You could also include building a LaTeX document
Example with LaTeX
Makefile
.PHONY: all paper clean help settings
COUNT=src/countwords.py
DATA=$(wildcard data/*.txt)
RESULTS=$(patsubst data/%.txt,results/%.csv,$(DATA))
## all : regenerate paper and all results.
all : paper.pdf $(RESULTS)
## results/%.csv : regenerate result for any book.
results/%.csv : data/%.txt $(COUNT)
python $(COUNT) $< > $@
## paper.pdf : regenerate paper.
paper.pdf : paper.tex paper.bib $(RESULTS)
latexmk -pdf $<
## clean : remove all generated files.
clean :
rm -f $(RESULTS)
latexmk -c
## settings : show variables' names
settings :
@echo COUNT: $(COUNT)
@echo DATA: $(DATA)
@echo RESULTS: $(RESULTS)
## help : show this message
help :
@grep '^##' ./Makefile