Intro to files and command-line inputs in Python

Author
Affiliation

Kyle Niemeyer

Oregon State University

Published

January 22, 2025

Working with files in Python, including HDF5

Situations for working with files:

  • Collaborator emails you raw data, want to look at the results
  • You want to email a collaborator data from Python
  • You need to use external code that takes input or data file (potentially 100s/1000s of times); want to automate generation of input files from data
  • An external program writes out one or more results files, and you want to read and perform analysis
  • You want to keep an intermediate calculation for debugging or validation

Saving or loading data: file handle object

f = open('data.txt')

This does:

  • Make sure data.txt exists
  • Create new handle to this file
  • Set cursor position pos to start of the file, pos = 0

This does not read any of the file into memory, write anything to the file, or close the file.

File handle methods

  • f.read(n=-1): Reads in n bytes; if n=-1 or not present, read entire rest of file.
  • f.readline(): Read next full line, return string with newline character.
  • f.readlines(): Reads entire rest of file, returns list of strings (with newlines).
  • f.seek(pos): Move file cursor to specified position.
  • f.tell(): Return current position in file.
  • f.write(s): Insert string s at current position.
  • f.flush(): Perform all pending write operations.
  • f.close(): Close file (no more reading or writing).

Example: read in matrix

matrix.txt contents:

1,4,15,9
0,11,7,3
2,8,12,13
14,5,10,6
f = open('matrix.txt')
matrix = []
for line in f.readlines():
    row = [int(x) for x in line.split(',')]
    matrix.append(row)
f.close()

Make sure to close the file!

How to do this better?

Use context manager & with to automatically close:

import numpy as np
with open('matrix.txt', 'r') as f:
    lines = f.readlines()

matrix = np.array([[int(x) for x in line.split(',')]
                  for line in lines]
                  )

… actually, this example can be even easier:

import numpy as np
matrix = np.genfromtxt('matrix.txt', delimiter=',')

File modes

When opening a file, by default it opens in read-only mode. Other file modes:

  • 'r': Read-only, no writing. Starts at pos = 0.
  • 'w': Write. Creates file if it doesn’t exist (if it does, contents are deleted). Starts at pos = 0.
  • 'a': Append. Opens file for writing but does not delete contents; creates file if it doesn’t exist. Starting pos is end of file.
  • '+': Update. Opens for reading and writing, does not delete current contents. Starts at pos = 0.

More-complicated example: read in matrix and add

with open('matrix.txt', 'r+') as f:
    orig = f.read()
    f.seek(0)
    f.write('0,0,0,0\n')
    f.write(orig)
    f.write('\n1,1,1,1')

Quick intro to HDF5

HDF: Hierarchical data format

Basic idea: better to store structured, numerical data in binary formats over plain-text ASCII files. Why? Smaller.

# small ints        # medium ints
42   (4 bytes)      123456   (4 bytes)
'42' (2 bytes)      '123456' (6 bytes)

# near-int floats   # e-notation floats
12.34   (8 bytes)   42.424242E+42   (8 bytes)
'12.34' (5 bytes)   '42.424242E+42' (13 bytes)

Also, faster I/O, since binary files save in native format.

HDF5: Hierarchical Data Format, version 5

HDF5 files (.hdf5, .h5) store data in binary format.

HDF5 provides database features like storing many datasets, user-defined metadata, optimized I/O, and ability to query contents.

HDF5 is a filesystem in a file: provides nested tree structure for datasets.

Using PyTables for HDF5

import tables as tb

PyTables provides five basic dataset classes:

  • Array: homogeneous components, fixed size; great for numerical data
  • CArray: chunked arrays
  • EArray: extendable arrays
  • VLArray: variable-length arrays
  • Table: structured collection of records whose values are stored in fixed-length fields

Constructs need to be composed of atomic types:

  • bool: true or false type
  • int: signed integer types
  • uint: unsigned integer types
  • float: floating-point types
  • complex: complex floating-point types
  • string: fixed-length raw string type

Also: Groups, links, and hidden nodes

Getting started with HDF5 files

import tables as tb
h5file = tb.open_file('/path/to/file', 'a')
...

… or, better yet:

import tables as tb
with tb.open_file('/path/to/file', 'a') as h5file:
    ...

HDF5 file modes

  • 'r': Read-only; no data can be modified.
  • 'w': Write; create a new file (delete existing file with that name).
  • 'a': Append; open existing file for reading or writing, or create new file.
  • 'r+': Similar to a, but file must exist.

HDF5 basics

HDF5 files: all nodes stem from root node: / or h5file.root

Natural naming: can access subnodes as attributes of “parent” nodes, like h5file.root.a_group.some_data

(all relevant nodes in the tree must have names that are valid Python variable names)

Creating nodes, arrays, and tables

Create and access a group on the root node:

h5file.create_group(h5file.root, 'a_group', "My Group")
h5file.root.a_group

Datasets: arrays and tables, each with a corresponding create method:

  • Arrays are fixed size, must be created with data.
  • Tables have a set data type but are variable length, so we can append to them after creation.

Creating nodes, arrays, and tables

# integer array
h5file.create_array(h5file.root.a_group, 'arthur_count', [1, 2, 5, 3])

# tables need descriptions
dt = np.dtype([('id', int), ('name', 'S10')])
knights = np.array([(42, 'Lancelot'), (12, 'Bedivere')], dtype=dt)
h5file.create_table(h5file.root, 'knights', dt)
h5file.root.knights.append(knights)

Hierarchy at this point:

/
|-- a_group/
|   |-- arther_count
|
|-- knights

Arrays & tables preserve original flavor

h5file.root.a_group.arthur_count[:]
type(h5file.root.a_group.arthur_count[:])
type(h5file.root.a_group.arthur_count)
h5file.root.knights[1] # pull out second row

h5file.root.knights[:1] # slice the first row

mask = (h5file.root.knights.cols.id[:] < 28)
h5file.root.knights[mask] # create mask and apply to table

# Fancy index the second and first rows, in that order
h5file.root.knights[([1, 0],)]

Memory mapping: pull in data from disk only as needed; HDF5 takes care of this automatically

Can create rows and append to tables

table_def = {'time': tables.Float64Col(pos=0),
             'temperature': tables.Float64Col(pos=1),
             'pressure': tables.Float64Col(pos=2),
             }
with tables.open_file(filename, mode='w', title="Table title"
                      ) as h5file:
    table = h5file.create_table(where=h5file.root,
                                name='simulation',
                                description=table_def
                                )
    
    # sim is an object representing a simulation
    
    timestep = table.row
    # Save initial conditions
    timestep['time'] = sim.time
    timestep['temperature'] = sim.temperature
    timestep['pressure'] = sim.pressure
    timestep.append()

    # Main time integration loop
    while sim.time < time_end:
        sim.step() # perform a single integration step

        # Save new timestep information
        timestep['time'] = sim.time
        timestep['temperature'] = sim.temperature
        timestep['pressure'] = sim.pressure

        # Add ``timestep`` to table
        timestep.append()
    table.flush()

Accessing this table

with tables.open_file(filename, 'r') as h5file:
    # Load Table with Group name simulation
    table = h5file.root.simulation

    time = table.col('time')
    pressure = table.col('pressure')
    temperature = table.col('temperature')

These are NumPy arrays!

Hierarchy layout

# particles:  id, kind,       velocity
particles = [(42, 'electron', 72.0),
             (43, 'proton', 0.1),
             (44, 'electron', 76.8),
             (45, 'neutron', 0.39),
             (46, 'neutron', 0.72),
             (47, 'neutron', 0.55),
             (48, 'proton', 0.18),
             (49, 'neutron', 0.23),
             ...
             ]

If we know we want to look at neutral and charged particles separately, then why search through all the particles all the time?

neutral = [(45, 'neutron',  0.39),
           (46, 'neutron',  0.72),
           (47, 'neutron',  0.55),
           (49, 'neutron',  0.23),
           ...
           ]

charged = [(42, 'electron', 72.0),
           (43, 'proton', 0.1),
           (44, 'electron', 76.8),
           (48, 'proton', 0.18),
           ...
           ]

But now kind is redundant in the neutral table. Let’s delete this column and rely on the structure of the tables together to dictate which table refers to what.

neutral = [(45, 0.39),
           (46, 0.72),
           (47, 0.55),
           (49, 0.23),
           ...
           ]

charged = [(42, 'electron', 72.0),
           (43, 'proton', 0.1),
           (44, 'electron', 76.8),
           (48, 'proton', 0.18),
           ...
           ]

We are embedding information directly into the semantics of the hierarchy.

We could add another layer that distinguishes the particles based on detector:

/
|-- detector1/
|   |-- neutral
|   |-- charged
|
|-- detector2/
|   |-- neutral
|   |-- charged

Data should be broken up like this to improve access time speeds. This is more efficient because there are: fewer rows to search, fewer rows to pull from disk, and fewer columns in description (decreases size of rows).

More complicated topics

  • Chunking, in-core and out-of-core operations, querying, compression.
  • other binary format: NetCDF, flat (meaning non-hierarchical) array-oriented format also used in scientific computing. netCDF-4 builds on HDF5, though.

Creating command-line interfaces for Python programs with argparse

Command-line interfaces

Rather than only calling your package from another Python code, you can (fairly easily)create a command-line interface.

myprogram --help

myprogram --input inputfile.txt --output output.h5

You can and should also use these to change inputs or settings to your programs without modifying code.

Simple example

# contents of example.py
# import the necessary packages
import sys
import argparse

# construct the argument parse and parse the arguments
parser = argparse.ArgumentParser(
    description='This is a simple command-line program.'
    )
parser.add_argument('-n', '--name', required=True,
                    help='name of the user'
                    )
args = parser.parse_args(sys.argv[1:])

# display a friendly message to the user
print("Hi there {}, it's nice to meet you!".format(args.name))
$ python example.py --help
$ python example.py --name Kyle

Optional arguments

...
parser.add_argument('-c', '--count', action='store_true',
                    default=False,
                    help='Count number of characters in name'
                    )
args = parser.parse_args(sys.argv[1:])

print("Hi there {}, it's nice to meet you!".format(args.name))
if args.count:
    print("Name length: {}".format(len(args.name)))
$ python example.py --help
$ python example.py --name Kyle -c

Optional arguments

...
parser.add_argument('-a', '--age',
                    default=25,
                    type=int,
                    help='Age of person'
                    )
args = parser.parse_args(sys.argv[1:])

print("Hi there {}, it's nice to meet you!".format(args.name))
if args.count:
    print("Name length: {}".format(len(args.name)))
print("Age: {}".format(args.age))
$ python example.py --help
$ python example.py --name Kyle --age 31

More complex input

For larger/more complex input, may be better to use an input file.

YAML is a good option for this: use pyyaml (import yaml)

Example YAML input file

input.yaml:

case: square cavity
length: 1.0
resolution: 0.01
velocities:
  - 1.0
  - 10.0
  - 100.0
import yaml
with open('input.yaml', 'r') as f:
    inputs = yaml.safe_load(f)

Returns a dictionary

Takehome messages

  • If you need to store numerical data, use HDF5 instead of plaintext.
  • Consider implementing a command-line interface for your program, possibly in combination with more complex YAML configuration/input files.