Intro to files and command-line inputs in Python

Author

Affiliation

Kyle Niemeyer

Oregon State University

Published

January 22, 2025

Working with files in Python, including HDF5

Situations for working with files:

Collaborator emails you raw data, want to look at the results
You want to email a collaborator data from Python
You need to use external code that takes input or data file (potentially 100s/1000s of times); want to automate generation of input files from data
An external program writes out one or more results files, and you want to read and perform analysis
You want to keep an intermediate calculation for debugging or validation

Saving or loading data: file handle object

f = open('data.txt')

This does:

Make sure data.txt exists
Create new handle to this file
Set cursor position pos to start of the file, pos = 0

This does not read any of the file into memory, write anything to the file, or close the file.

File handle methods

f.read(n=-1): Reads in n bytes; if n=-1 or not present, read entire rest of file.
f.readline(): Read next full line, return string with newline character.
f.readlines(): Reads entire rest of file, returns list of strings (with newlines).
f.seek(pos): Move file cursor to specified position.
f.tell(): Return current position in file.
f.write(s): Insert string s at current position.
f.flush(): Perform all pending write operations.
f.close(): Close file (no more reading or writing).

Example: read in matrix

matrix.txt contents:

1,4,15,9
0,11,7,3
2,8,12,13
14,5,10,6

f = open('matrix.txt')
matrix = []
for line in f.readlines():
    row = [int(x) for x in line.split(',')]
    matrix.append(row)
f.close()

Make sure to close the file!

How to do this better?

Use context manager & with to automatically close:

import numpy as np
with open('matrix.txt', 'r') as f:
    lines = f.readlines()

matrix = np.array([[int(x) for x in line.split(',')]
                  for line in lines]
                  )

… actually, this example can be even easier:

import numpy as np
matrix = np.genfromtxt('matrix.txt', delimiter=',')

File modes

When opening a file, by default it opens in read-only mode. Other file modes:

'r': Read-only, no writing. Starts at pos = 0.
'w': Write. Creates file if it doesn’t exist (if it does, contents are deleted). Starts at pos = 0.
'a': Append. Opens file for writing but does not delete contents; creates file if it doesn’t exist. Starting pos is end of file.
'+': Update. Opens for reading and writing, does not delete current contents. Starts at pos = 0.

More-complicated example: read in matrix and add

with open('matrix.txt', 'r+') as f:
    orig = f.read()
    f.seek(0)
    f.write('0,0,0,0\n')
    f.write(orig)
    f.write('\n1,1,1,1')

Quick intro to HDF5

HDF: Hierarchical data format

Basic idea: better to store structured, numerical data in binary formats over plain-text ASCII files. Why? Smaller.

# small ints        # medium ints
42   (4 bytes)      123456   (4 bytes)
'42' (2 bytes)      '123456' (6 bytes)

# near-int floats   # e-notation floats
12.34   (8 bytes)   42.424242E+42   (8 bytes)
'12.34' (5 bytes)   '42.424242E+42' (13 bytes)

Also, faster I/O, since binary files save in native format.

HDF5: Hierarchical Data Format, version 5

HDF5 files (.hdf5, .h5) store data in binary format.

HDF5 provides database features like storing many datasets, user-defined metadata, optimized I/O, and ability to query contents.

HDF5 is a filesystem in a file: provides nested tree structure for datasets.

Using PyTables for HDF5

import tables as tb

PyTables provides five basic dataset classes:

Array: homogeneous components, fixed size; great for numerical data
CArray: chunked arrays
EArray: extendable arrays
VLArray: variable-length arrays
Table: structured collection of records whose values are stored in fixed-length fields

Constructs need to be composed of atomic types:

bool: true or false type
int: signed integer types
uint: unsigned integer types
float: floating-point types
complex: complex floating-point types
string: fixed-length raw string type

Also: Groups, links, and hidden nodes

Getting started with HDF5 files

import tables as tb
h5file = tb.open_file('/path/to/file', 'a')
...

… or, better yet:

import tables as tb
with tb.open_file('/path/to/file', 'a') as h5file:
    ...

HDF5 file modes

'r': Read-only; no data can be modified.
'w': Write; create a new file (delete existing file with that name).
'a': Append; open existing file for reading or writing, or create new file.
'r+': Similar to a, but file must exist.

HDF5 basics

HDF5 files: all nodes stem from root node: / or h5file.root

Natural naming: can access subnodes as attributes of “parent” nodes, like h5file.root.a_group.some_data

(all relevant nodes in the tree must have names that are valid Python variable names)

Creating nodes, arrays, and tables

Create and access a group on the root node:

h5file.create_group(h5file.root, 'a_group', "My Group")
h5file.root.a_group

Datasets: arrays and tables, each with a corresponding create method:

Arrays are fixed size, must be created with data.
Tables have a set data type but are variable length, so we can append to them after creation.

Creating nodes, arrays, and tables

# integer array
h5file.create_array(h5file.root.a_group, 'arthur_count', [1, 2, 5, 3])

# tables need descriptions
dt = np.dtype([('id', int), ('name', 'S10')])
knights = np.array([(42, 'Lancelot'), (12, 'Bedivere')], dtype=dt)
h5file.create_table(h5file.root, 'knights', dt)
h5file.root.knights.append(knights)

Hierarchy at this point:

/
|-- a_group/
|   |-- arther_count
|
|-- knights

Arrays & tables preserve original flavor

h5file.root.a_group.arthur_count[:]
type(h5file.root.a_group.arthur_count[:])
type(h5file.root.a_group.arthur_count)

h5file.root.knights[1] # pull out second row

h5file.root.knights[:1] # slice the first row

mask = (h5file.root.knights.cols.id[:] < 28)
h5file.root.knights[mask] # create mask and apply to table

# Fancy index the second and first rows, in that order
h5file.root.knights[([1, 0],)]

Memory mapping: pull in data from disk only as needed; HDF5 takes care of this automatically

Can create rows and append to tables

table_def = {'time': tables.Float64Col(pos=0),
             'temperature': tables.Float64Col(pos=1),
             'pressure': tables.Float64Col(pos=2),
             }
with tables.open_file(filename, mode='w', title="Table title"
                      ) as h5file:
    table = h5file.create_table(where=h5file.root,
                                name='simulation',
                                description=table_def
                                )
    
    # sim is an object representing a simulation
    
    timestep = table.row
    # Save initial conditions
    timestep['time'] = sim.time
    timestep['temperature'] = sim.temperature
    timestep['pressure'] = sim.pressure
    timestep.append()

    # Main time integration loop
    while sim.time < time_end:
        sim.step() # perform a single integration step

        # Save new timestep information
        timestep['time'] = sim.time
        timestep['temperature'] = sim.temperature
        timestep['pressure'] = sim.pressure

        # Add ``timestep`` to table
        timestep.append()
    table.flush()

Accessing this table

with tables.open_file(filename, 'r') as h5file:
    # Load Table with Group name simulation
    table = h5file.root.simulation

    time = table.col('time')
    pressure = table.col('pressure')
    temperature = table.col('temperature')

These are NumPy arrays!

Hierarchy layout

# particles:  id, kind,       velocity
particles = [(42, 'electron', 72.0),
             (43, 'proton', 0.1),
             (44, 'electron', 76.8),
             (45, 'neutron', 0.39),
             (46, 'neutron', 0.72),
             (47, 'neutron', 0.55),
             (48, 'proton', 0.18),
             (49, 'neutron', 0.23),
             ...
             ]

If we know we want to look at neutral and charged particles separately, then why search through all the particles all the time?

neutral = [(45, 'neutron',  0.39),
           (46, 'neutron',  0.72),
           (47, 'neutron',  0.55),
           (49, 'neutron',  0.23),
           ...
           ]

charged = [(42, 'electron', 72.0),
           (43, 'proton', 0.1),
           (44, 'electron', 76.8),
           (48, 'proton', 0.18),
           ...
           ]

But now kind is redundant in the neutral table. Let’s delete this column and rely on the structure of the tables together to dictate which table refers to what.

neutral = [(45, 0.39),
           (46, 0.72),
           (47, 0.55),
           (49, 0.23),
           ...
           ]

charged = [(42, 'electron', 72.0),
           (43, 'proton', 0.1),
           (44, 'electron', 76.8),
           (48, 'proton', 0.18),
           ...
           ]

We are embedding information directly into the semantics of the hierarchy.

We could add another layer that distinguishes the particles based on detector:

/
|-- detector1/
|   |-- neutral
|   |-- charged
|
|-- detector2/
|   |-- neutral
|   |-- charged

Data should be broken up like this to improve access time speeds. This is more efficient because there are: fewer rows to search, fewer rows to pull from disk, and fewer columns in description (decreases size of rows).

Creating command-line interfaces for Python programs with argparse

Command-line interfaces

Rather than only calling your package from another Python code, you can (fairly easily)create a command-line interface.

myprogram --help

myprogram --input inputfile.txt --output output.h5

You can and should also use these to change inputs or settings to your programs without modifying code.

Simple example

# contents of example.py
# import the necessary packages
import sys
import argparse

# construct the argument parse and parse the arguments
parser = argparse.ArgumentParser(
    description='This is a simple command-line program.'
    )
parser.add_argument('-n', '--name', required=True,
                    help='name of the user'
                    )
args = parser.parse_args(sys.argv[1:])

# display a friendly message to the user
print("Hi there {}, it's nice to meet you!".format(args.name))

$ python example.py --help
$ python example.py --name Kyle

Optional arguments

...
parser.add_argument('-c', '--count', action='store_true',
                    default=False,
                    help='Count number of characters in name'
                    )
args = parser.parse_args(sys.argv[1:])

print("Hi there {}, it's nice to meet you!".format(args.name))
if args.count:
    print("Name length: {}".format(len(args.name)))

$ python example.py --help
$ python example.py --name Kyle -c

Optional arguments

...
parser.add_argument('-a', '--age',
                    default=25,
                    type=int,
                    help='Age of person'
                    )
args = parser.parse_args(sys.argv[1:])

print("Hi there {}, it's nice to meet you!".format(args.name))
if args.count:
    print("Name length: {}".format(len(args.name)))
print("Age: {}".format(args.age))

$ python example.py --help
$ python example.py --name Kyle --age 31

More complex input

For larger/more complex input, may be better to use an input file.

YAML is a good option for this: use pyyaml (import yaml)

Example YAML input file

input.yaml:

case: square cavity
length: 1.0
resolution: 0.01
velocities:
  - 1.0
  - 10.0
  - 100.0

import yaml
with open('input.yaml', 'r') as f:
    inputs = yaml.safe_load(f)

Returns a dictionary

Takehome messages

If you need to store numerical data, use HDF5 instead of plaintext.
Consider implementing a command-line interface for your program, possibly in combination with more complex YAML configuration/input files.

Working with files in Python, including HDF5

Situations for working with files:

Saving or loading data: file handle object

File handle methods

Example: read in matrix

How to do this better?

File modes

More-complicated example: read in matrix and add

Quick intro to HDF5

HDF: Hierarchical data format

HDF5: Hierarchical Data Format, version 5

Using PyTables for HDF5

Getting started with HDF5 files

HDF5 file modes

HDF5 basics

Creating nodes, arrays, and tables

Creating nodes, arrays, and tables

Arrays & tables preserve original flavor

Can create rows and append to tables

Accessing this table

Hierarchy layout

More complicated topics

Creating command-line interfaces for Python programs with argparse

Command-line interfaces

Simple example

Optional arguments

Optional arguments

More complex input

Example YAML input file

Takehome messages