Intro to files and command-line inputs in Python
Working with files in Python, including HDF5
Situations for working with files:
- Collaborator emails you raw data, want to look at the results
- You want to email a collaborator data from Python
- You need to use external code that takes input or data file (potentially 100s/1000s of times); want to automate generation of input files from data
- An external program writes out one or more results files, and you want to read and perform analysis
- You want to keep an intermediate calculation for debugging or validation
Saving or loading data: file handle object
f = open('data.txt')This does:
- Make sure
data.txtexists - Create new handle to this file
- Set cursor position
posto start of the file,pos = 0
This does not read any of the file into memory, write anything to the file, or close the file.
File handle methods
f.read(n=-1): Reads innbytes; ifn=-1or not present, read entire rest of file.f.readline(): Read next full line, return string with newline character.f.readlines(): Reads entire rest of file, returns list of strings (with newlines).f.seek(pos): Move file cursor to specified position.f.tell(): Return current position in file.f.write(s): Insert stringsat current position.f.flush(): Perform all pending write operations.f.close(): Close file (no more reading or writing).
Example: read in matrix
matrix.txt contents:
1,4,15,9
0,11,7,3
2,8,12,13
14,5,10,6
f = open('matrix.txt')
matrix = []
for line in f.readlines():
row = [int(x) for x in line.split(',')]
matrix.append(row)
f.close()Make sure to close the file!
How to do this better?
Use context manager & with to automatically close:
import numpy as np
with open('matrix.txt', 'r') as f:
lines = f.readlines()
matrix = np.array([[int(x) for x in line.split(',')]
for line in lines]
)… actually, this example can be even easier:
import numpy as np
matrix = np.genfromtxt('matrix.txt', delimiter=',')File modes
When opening a file, by default it opens in read-only mode. Other file modes:
'r': Read-only, no writing. Starts atpos = 0.'w': Write. Creates file if it doesn’t exist (if it does, contents are deleted). Starts atpos = 0.'a': Append. Opens file for writing but does not delete contents; creates file if it doesn’t exist. Startingposis end of file.'+': Update. Opens for reading and writing, does not delete current contents. Starts atpos = 0.
More-complicated example: read in matrix and add
with open('matrix.txt', 'r+') as f:
orig = f.read()
f.seek(0)
f.write('0,0,0,0\n')
f.write(orig)
f.write('\n1,1,1,1')Quick intro to HDF5
HDF: Hierarchical data format
Basic idea: better to store structured, numerical data in binary formats over plain-text ASCII files. Why? Smaller.
# small ints # medium ints
42 (4 bytes) 123456 (4 bytes)
'42' (2 bytes) '123456' (6 bytes)
# near-int floats # e-notation floats
12.34 (8 bytes) 42.424242E+42 (8 bytes)
'12.34' (5 bytes) '42.424242E+42' (13 bytes)
Also, faster I/O, since binary files save in native format.
HDF5: Hierarchical Data Format, version 5
HDF5 files (.hdf5, .h5) store data in binary format.
HDF5 provides database features like storing many datasets, user-defined metadata, optimized I/O, and ability to query contents.
HDF5 is a filesystem in a file: provides nested tree structure for datasets.
Using PyTables for HDF5
import tables as tb
PyTables provides five basic dataset classes:
Array: homogeneous components, fixed size; great for numerical dataCArray: chunked arraysEArray: extendable arraysVLArray: variable-length arraysTable: structured collection of records whose values are stored in fixed-length fields
Constructs need to be composed of atomic types:
bool: true or false typeint: signed integer typesuint: unsigned integer typesfloat: floating-point typescomplex: complex floating-point typesstring: fixed-length raw string type
Also: Groups, links, and hidden nodes
Getting started with HDF5 files
import tables as tb
h5file = tb.open_file('/path/to/file', 'a')
...… or, better yet:
import tables as tb
with tb.open_file('/path/to/file', 'a') as h5file:
...HDF5 file modes
'r': Read-only; no data can be modified.'w': Write; create a new file (delete existing file with that name).'a': Append; open existing file for reading or writing, or create new file.'r+': Similar toa, but file must exist.
HDF5 basics
HDF5 files: all nodes stem from root node: / or h5file.root
Natural naming: can access subnodes as attributes of “parent” nodes, like h5file.root.a_group.some_data
(all relevant nodes in the tree must have names that are valid Python variable names)
Creating nodes, arrays, and tables
Create and access a group on the root node:
h5file.create_group(h5file.root, 'a_group', "My Group")
h5file.root.a_groupDatasets: arrays and tables, each with a corresponding create method:
- Arrays are fixed size, must be created with data.
- Tables have a set data type but are variable length, so we can append to them after creation.
Creating nodes, arrays, and tables
# integer array
h5file.create_array(h5file.root.a_group, 'arthur_count', [1, 2, 5, 3])
# tables need descriptions
dt = np.dtype([('id', int), ('name', 'S10')])
knights = np.array([(42, 'Lancelot'), (12, 'Bedivere')], dtype=dt)
h5file.create_table(h5file.root, 'knights', dt)
h5file.root.knights.append(knights)Hierarchy at this point:
/
|-- a_group/
| |-- arther_count
|
|-- knights
Arrays & tables preserve original flavor
h5file.root.a_group.arthur_count[:]
type(h5file.root.a_group.arthur_count[:])
type(h5file.root.a_group.arthur_count)h5file.root.knights[1] # pull out second row
h5file.root.knights[:1] # slice the first row
mask = (h5file.root.knights.cols.id[:] < 28)
h5file.root.knights[mask] # create mask and apply to table
# Fancy index the second and first rows, in that order
h5file.root.knights[([1, 0],)]Memory mapping: pull in data from disk only as needed; HDF5 takes care of this automatically
Can create rows and append to tables
table_def = {'time': tables.Float64Col(pos=0),
'temperature': tables.Float64Col(pos=1),
'pressure': tables.Float64Col(pos=2),
}
with tables.open_file(filename, mode='w', title="Table title"
) as h5file:
table = h5file.create_table(where=h5file.root,
name='simulation',
description=table_def
)
# sim is an object representing a simulation
timestep = table.row
# Save initial conditions
timestep['time'] = sim.time
timestep['temperature'] = sim.temperature
timestep['pressure'] = sim.pressure
timestep.append()
# Main time integration loop
while sim.time < time_end:
sim.step() # perform a single integration step
# Save new timestep information
timestep['time'] = sim.time
timestep['temperature'] = sim.temperature
timestep['pressure'] = sim.pressure
# Add ``timestep`` to table
timestep.append()
table.flush()Accessing this table
with tables.open_file(filename, 'r') as h5file:
# Load Table with Group name simulation
table = h5file.root.simulation
time = table.col('time')
pressure = table.col('pressure')
temperature = table.col('temperature')These are NumPy arrays!
Hierarchy layout
# particles: id, kind, velocity
particles = [(42, 'electron', 72.0),
(43, 'proton', 0.1),
(44, 'electron', 76.8),
(45, 'neutron', 0.39),
(46, 'neutron', 0.72),
(47, 'neutron', 0.55),
(48, 'proton', 0.18),
(49, 'neutron', 0.23),
...
]If we know we want to look at neutral and charged particles separately, then why search through all the particles all the time?
neutral = [(45, 'neutron', 0.39),
(46, 'neutron', 0.72),
(47, 'neutron', 0.55),
(49, 'neutron', 0.23),
...
]
charged = [(42, 'electron', 72.0),
(43, 'proton', 0.1),
(44, 'electron', 76.8),
(48, 'proton', 0.18),
...
]But now kind is redundant in the neutral table. Let’s delete this column and rely on the structure of the tables together to dictate which table refers to what.
neutral = [(45, 0.39),
(46, 0.72),
(47, 0.55),
(49, 0.23),
...
]
charged = [(42, 'electron', 72.0),
(43, 'proton', 0.1),
(44, 'electron', 76.8),
(48, 'proton', 0.18),
...
]We are embedding information directly into the semantics of the hierarchy.
We could add another layer that distinguishes the particles based on detector:
/
|-- detector1/
| |-- neutral
| |-- charged
|
|-- detector2/
| |-- neutral
| |-- chargedData should be broken up like this to improve access time speeds. This is more efficient because there are: fewer rows to search, fewer rows to pull from disk, and fewer columns in description (decreases size of rows).
More complicated topics
- Chunking, in-core and out-of-core operations, querying, compression.
- other binary format: NetCDF, flat (meaning non-hierarchical) array-oriented format also used in scientific computing. netCDF-4 builds on HDF5, though.
Creating command-line interfaces for Python programs with argparse
Command-line interfaces
Rather than only calling your package from another Python code, you can (fairly easily)create a command-line interface.
myprogram --help
myprogram --input inputfile.txt --output output.h5You can and should also use these to change inputs or settings to your programs without modifying code.
Simple example
# contents of example.py
# import the necessary packages
import sys
import argparse
# construct the argument parse and parse the arguments
parser = argparse.ArgumentParser(
description='This is a simple command-line program.'
)
parser.add_argument('-n', '--name', required=True,
help='name of the user'
)
args = parser.parse_args(sys.argv[1:])
# display a friendly message to the user
print("Hi there {}, it's nice to meet you!".format(args.name))$ python example.py --help
$ python example.py --name KyleOptional arguments
...
parser.add_argument('-c', '--count', action='store_true',
default=False,
help='Count number of characters in name'
)
args = parser.parse_args(sys.argv[1:])
print("Hi there {}, it's nice to meet you!".format(args.name))
if args.count:
print("Name length: {}".format(len(args.name)))$ python example.py --help
$ python example.py --name Kyle -cOptional arguments
...
parser.add_argument('-a', '--age',
default=25,
type=int,
help='Age of person'
)
args = parser.parse_args(sys.argv[1:])
print("Hi there {}, it's nice to meet you!".format(args.name))
if args.count:
print("Name length: {}".format(len(args.name)))
print("Age: {}".format(args.age))$ python example.py --help
$ python example.py --name Kyle --age 31More complex input
For larger/more complex input, may be better to use an input file.
YAML is a good option for this: use pyyaml (import yaml)
Example YAML input file
input.yaml:
case: square cavity
length: 1.0
resolution: 0.01
velocities:
- 1.0
- 10.0
- 100.0import yaml
with open('input.yaml', 'r') as f:
inputs = yaml.safe_load(f)Returns a dictionary
Takehome messages
- If you need to store numerical data, use HDF5 instead of plaintext.
- Consider implementing a command-line interface for your program, possibly in combination with more complex YAML configuration/input files.