Intro to files and command-line inputs in Python
Working with files in Python, including HDF5
Situations for working with files:
- Collaborator emails you raw data, want to look at the results
- You want to email a collaborator data from Python
- You need to use external code that takes input or data file (potentially 100s/1000s of times); want to automate generation of input files from data
- An external program writes out one or more results files, and you want to read and perform analysis
- You want to keep an intermediate calculation for debugging or validation
Saving or loading data: file handle object
= open('data.txt') f
This does:
- Make sure
data.txt
exists - Create new handle to this file
- Set cursor position
pos
to start of the file,pos = 0
This does not read any of the file into memory, write anything to the file, or close the file.
File handle methods
f.read(n=-1)
: Reads inn
bytes; ifn=-1
or not present, read entire rest of file.f.readline()
: Read next full line, return string with newline character.f.readlines()
: Reads entire rest of file, returns list of strings (with newlines).f.seek(pos)
: Move file cursor to specified position.f.tell()
: Return current position in file.f.write(s)
: Insert strings
at current position.f.flush()
: Perform all pending write operations.f.close()
: Close file (no more reading or writing).
Example: read in matrix
matrix.txt
contents:
1,4,15,9
0,11,7,3
2,8,12,13
14,5,10,6
= open('matrix.txt')
f = []
matrix for line in f.readlines():
= [int(x) for x in line.split(',')]
row
matrix.append(row) f.close()
Make sure to close the file!
How to do this better?
Use context manager & with
to automatically close:
import numpy as np
with open('matrix.txt', 'r') as f:
= f.readlines()
lines
= np.array([[int(x) for x in line.split(',')]
matrix for line in lines]
)
… actually, this example can be even easier:
import numpy as np
= np.genfromtxt('matrix.txt', delimiter=',') matrix
File modes
When opening a file, by default it opens in read-only mode. Other file modes:
'r'
: Read-only, no writing. Starts atpos = 0
.'w'
: Write. Creates file if it doesn’t exist (if it does, contents are deleted). Starts atpos = 0
.'a'
: Append. Opens file for writing but does not delete contents; creates file if it doesn’t exist. Startingpos
is end of file.'+'
: Update. Opens for reading and writing, does not delete current contents. Starts atpos = 0
.
More-complicated example: read in matrix and add
with open('matrix.txt', 'r+') as f:
= f.read()
orig 0)
f.seek('0,0,0,0\n')
f.write(
f.write(orig)'\n1,1,1,1') f.write(
Quick intro to HDF5
HDF: Hierarchical data format
Basic idea: better to store structured, numerical data in binary formats over plain-text ASCII files. Why? Smaller.
# small ints # medium ints
42 (4 bytes) 123456 (4 bytes)
'42' (2 bytes) '123456' (6 bytes)
# near-int floats # e-notation floats
12.34 (8 bytes) 42.424242E+42 (8 bytes)
'12.34' (5 bytes) '42.424242E+42' (13 bytes)
Also, faster I/O, since binary files save in native format.
HDF5: Hierarchical Data Format, version 5
HDF5 files (.hdf5
, .h5
) store data in binary format.
HDF5 provides database features like storing many datasets, user-defined metadata, optimized I/O, and ability to query contents.
HDF5 is a filesystem in a file: provides nested tree structure for datasets.
Using PyTables for HDF5
import tables as tb
PyTables provides five basic dataset classes:
Array
: homogeneous components, fixed size; great for numerical dataCArray
: chunked arraysEArray
: extendable arraysVLArray
: variable-length arraysTable
: structured collection of records whose values are stored in fixed-length fields
Constructs need to be composed of atomic types:
bool
: true or false typeint
: signed integer typesuint
: unsigned integer typesfloat
: floating-point typescomplex
: complex floating-point typesstring
: fixed-length raw string type
Also: Groups, links, and hidden nodes
Getting started with HDF5 files
import tables as tb
= tb.open_file('/path/to/file', 'a')
h5file ...
… or, better yet:
import tables as tb
with tb.open_file('/path/to/file', 'a') as h5file:
...
HDF5 file modes
'r'
: Read-only; no data can be modified.'w'
: Write; create a new file (delete existing file with that name).'a'
: Append; open existing file for reading or writing, or create new file.'r+'
: Similar toa
, but file must exist.
HDF5 basics
HDF5 files: all nodes stem from root node: /
or h5file.root
Natural naming: can access subnodes as attributes of “parent” nodes, like h5file.root.a_group.some_data
(all relevant nodes in the tree must have names that are valid Python variable names)
Creating nodes, arrays, and tables
Create and access a group on the root node:
'a_group', "My Group")
h5file.create_group(h5file.root, h5file.root.a_group
Datasets: arrays and tables, each with a corresponding create
method:
- Arrays are fixed size, must be created with data.
- Tables have a set data type but are variable length, so we can append to them after creation.
Creating nodes, arrays, and tables
# integer array
'arthur_count', [1, 2, 5, 3])
h5file.create_array(h5file.root.a_group,
# tables need descriptions
= np.dtype([('id', int), ('name', 'S10')])
dt = np.array([(42, 'Lancelot'), (12, 'Bedivere')], dtype=dt)
knights 'knights', dt)
h5file.create_table(h5file.root, h5file.root.knights.append(knights)
Hierarchy at this point:
/
|-- a_group/
| |-- arther_count
|
|-- knights
Arrays & tables preserve original flavor
h5file.root.a_group.arthur_count[:]type(h5file.root.a_group.arthur_count[:])
type(h5file.root.a_group.arthur_count)
1] # pull out second row
h5file.root.knights[
1] # slice the first row
h5file.root.knights[:
= (h5file.root.knights.cols.id[:] < 28)
mask # create mask and apply to table
h5file.root.knights[mask]
# Fancy index the second and first rows, in that order
1, 0],)] h5file.root.knights[([
Memory mapping: pull in data from disk only as needed; HDF5 takes care of this automatically
Can create rows and append to tables
= {'time': tables.Float64Col(pos=0),
table_def 'temperature': tables.Float64Col(pos=1),
'pressure': tables.Float64Col(pos=2),
}with tables.open_file(filename, mode='w', title="Table title"
as h5file:
) = h5file.create_table(where=h5file.root,
table ='simulation',
name=table_def
description
)
# sim is an object representing a simulation
= table.row
timestep # Save initial conditions
'time'] = sim.time
timestep['temperature'] = sim.temperature
timestep['pressure'] = sim.pressure
timestep[
timestep.append()
# Main time integration loop
while sim.time < time_end:
# perform a single integration step
sim.step()
# Save new timestep information
'time'] = sim.time
timestep['temperature'] = sim.temperature
timestep['pressure'] = sim.pressure
timestep[
# Add ``timestep`` to table
timestep.append() table.flush()
Accessing this table
with tables.open_file(filename, 'r') as h5file:
# Load Table with Group name simulation
= h5file.root.simulation
table
= table.col('time')
time = table.col('pressure')
pressure = table.col('temperature') temperature
These are NumPy arrays!
Hierarchy layout
# particles: id, kind, velocity
= [(42, 'electron', 72.0),
particles 43, 'proton', 0.1),
(44, 'electron', 76.8),
(45, 'neutron', 0.39),
(46, 'neutron', 0.72),
(47, 'neutron', 0.55),
(48, 'proton', 0.18),
(49, 'neutron', 0.23),
(
... ]
If we know we want to look at neutral and charged particles separately, then why search through all the particles all the time?
= [(45, 'neutron', 0.39),
neutral 46, 'neutron', 0.72),
(47, 'neutron', 0.55),
(49, 'neutron', 0.23),
(
...
]
= [(42, 'electron', 72.0),
charged 43, 'proton', 0.1),
(44, 'electron', 76.8),
(48, 'proton', 0.18),
(
... ]
But now kind
is redundant in the neutral
table. Let’s delete this column and rely on the structure of the tables together to dictate which table refers to what.
= [(45, 0.39),
neutral 46, 0.72),
(47, 0.55),
(49, 0.23),
(
...
]
= [(42, 'electron', 72.0),
charged 43, 'proton', 0.1),
(44, 'electron', 76.8),
(48, 'proton', 0.18),
(
... ]
We are embedding information directly into the semantics of the hierarchy.
We could add another layer that distinguishes the particles based on detector:
/
|-- detector1/
| |-- neutral
| |-- charged
|
|-- detector2/
| |-- neutral
| |-- charged
Data should be broken up like this to improve access time speeds. This is more efficient because there are: fewer rows to search, fewer rows to pull from disk, and fewer columns in description (decreases size of rows).
More complicated topics
- Chunking, in-core and out-of-core operations, querying, compression.
- other binary format: NetCDF, flat (meaning non-hierarchical) array-oriented format also used in scientific computing. netCDF-4 builds on HDF5, though.
Creating command-line interfaces for Python programs with argparse
Command-line interfaces
Rather than only calling your package from another Python code, you can (fairly easily)create a command-line interface.
myprogram --help
myprogram --input inputfile.txt --output output.h5
You can and should also use these to change inputs or settings to your programs without modifying code.
Simple example
# contents of example.py
# import the necessary packages
import sys
import argparse
# construct the argument parse and parse the arguments
= argparse.ArgumentParser(
parser ='This is a simple command-line program.'
description
)'-n', '--name', required=True,
parser.add_argument(help='name of the user'
)= parser.parse_args(sys.argv[1:])
args
# display a friendly message to the user
print("Hi there {}, it's nice to meet you!".format(args.name))
$ python example.py --help
$ python example.py --name Kyle
Optional arguments
...'-c', '--count', action='store_true',
parser.add_argument(=False,
defaulthelp='Count number of characters in name'
)= parser.parse_args(sys.argv[1:])
args
print("Hi there {}, it's nice to meet you!".format(args.name))
if args.count:
print("Name length: {}".format(len(args.name)))
$ python example.py --help
$ python example.py --name Kyle -c
Optional arguments
...'-a', '--age',
parser.add_argument(=25,
defaulttype=int,
help='Age of person'
)= parser.parse_args(sys.argv[1:])
args
print("Hi there {}, it's nice to meet you!".format(args.name))
if args.count:
print("Name length: {}".format(len(args.name)))
print("Age: {}".format(args.age))
$ python example.py --help
$ python example.py --name Kyle --age 31
More complex input
For larger/more complex input, may be better to use an input file.
YAML is a good option for this: use pyyaml (import yaml
)
Example YAML input file
input.yaml
:
case: square cavity
length: 1.0
resolution: 0.01
velocities:
- 1.0
- 10.0
- 100.0
import yaml
with open('input.yaml', 'r') as f:
= yaml.safe_load(f) inputs
Returns a dictionary
Takehome messages
- If you need to store numerical data, use HDF5 instead of plaintext.
- Consider implementing a command-line interface for your program, possibly in combination with more complex YAML configuration/input files.