Files and Command-Line Arguments in Python

Files and Command-Line Inputs in Python

Software Development for Engineering Research

Kyle Niemeyer. 20 January 2022

ME 599, Corvallis, OR

Situations for working with files:

Collaborator emails you raw data, want to look at the results
You want to email a collaborator data from Python
You need to use external code that takes input or data file (potentially 100s/1000s of times); want to automate generation of input files from data
An external program writes out one or more results files, and you want to read and perform analysis
You want to keep an intermediate calculation for debugging or validation

Saving or loading data: file handle object


                            f = open('data.txt')

This does:

Make sure data.txt exists
Create new handle to this file
Set cursor position pos to start of the file, pos = 0

This does not read into memory any of the file, write anything to the file, or close the file.

File handle methods

f.read(n=-1): Reads in n bytes; if n=-1 or not present, read entire rest of file.
f.readline(): Read next full line, return string with newline character.
f.readlines(): Reads entire rest of file, returns list of strings (with newlines).
f.seek(pos): Move file cursor to specified position.
f.tell(): Return current position in file.
f.write(s): Insert string s at current position.
f.flush(): Perform all pending write operations.
f.close(): Close file (no more reading or writing).

Example: read in matrix

matrix.txt contents:


                            f = open('matrix.txt')
                            matrix = []
                            for line in f.readlines():
                                row = [int(x) for x in line.split(',')]
                                matrix.append(row)
                            f.close()

Make sure to close the file!

How to do this better?

Use context manager & with to automatically close:


                            import numpy as np
                            with open('matrix.txt', 'r') as f:
                                lines = f.readlines()

                            matrix = np.array([[int(x) for x in line.split(',')]
                                              for line in lines]
                                              )

... actually, this example can be even easier:


                            import numpy as np
                            matrix = np.genfromtxt('matrix.txt', delimiter=',')

File modes

When opening a file, by default it opens in read-only mode. Other file modes:

'r': Read-only, no writing. Starts at pos = 0.
'w': Write. Creates file if it doesn't exist (if it does, contents are deleted). Starts at pos = 0.
'a': Append. Opens file for writing but does not delete contents; creates file if it doesn't exist. Starting pos is end of file.
'+': Update. Opens for reading and writing, does not delete current contents. Starts at pos = 0.

More complicated example: read in matrix and add


                            with open('matrix.txt', 'r+') as f:
                                orig = f.read()
                                f.seek(0)
                                f.write('0,0,0,0\n')
                                f.write(orig)
                                f.write('\n1,1,1,1')

Quick intro to HDF5

Basic idea: better to store structured, numerical data in binary formats over plain-text ASCII files. Why? Smaller.


                            # small ints        # medium ints
                            42   (4 bytes)      123456   (4 bytes)
                            '42' (2 bytes)      '123456' (6 bytes)

                            # near-int floats   # e-notation floats
                            12.34   (8 bytes)   42.424242E+42   (8 bytes)
                            '12.34' (5 bytes)   '42.424242E+42' (13 bytes)

Also, faster I/O, since binary files save in native format.

HDF5: Hierarchical Data Format

HDF5 files store data in binary format.

HDF5 provides database features like storing many datasets, user-defined metadata, optimized I/O, and ability to query contents.

HDF5 is a filesystem in a file: provides nested tree structure for datasets.

Using PyTables for HDF5

import tables as tb

PyTables provides five basic dataset classes:

Array: files of the filesystem
CArray: chunked arrays
EArray: extendable arrays
VLArray: variable-length arrays
Table: structured arrays

Constructs need to be composed of atomic types:

bool: true or false type
int: signed integer types
uint: unsigned integer types
float: floating-point types
complex: complex floating-point types
string: fixed-length raw string type

Also: Groups, links, and hidden nodes

Getting started with HDF5 files


                            import tables as tb
                            h5file = tb.open_file('/path/to/file', 'a')
                            ...

... or, better yet:


                            import tables as tb
                            with tb.open_file('/path/to/file', 'a') as h5file:
                                ...

HDF5 file modes

'r': Read-only; no data can be modified.
'w': Write; create a new file (delete existing file with that name).
'a': Append; open existing file for reading or writing, or create new file.
'r+': Similar to a, but file must exist.

HDF5 files: all nodes stem from root node: / or h5file.root

Natural naming: can access subnodes as attributes of "parent" nodes, like h5file.root.a_group.some_data

(all relevant nodes in the tree must have names that are valid Python variable names)

Creating nodes, arrays, and tables

Create and access a group on the root node:


                                h5file.create_group(h5file.root, 'a_group', "My Group")
                                h5file.root.a_group

Datasets: arrays and tables, each with a corresponding create method:

Arrays are fixed size, must be created with data.
Tables have a set data type but are variable length, so we can append to them after creation.

Creating nodes, arrays, and tables


                            # integer array
                            h5file.create_array(h5file.root.a_group, 'arthur_count', [1, 2, 5, 3])

                            # tables need descriptions
                            dt = np.dtype([('id', int), ('name', 'S10')])
                            knights = np.array([(42, 'Lancelot'), (12, 'Bedivere')], dtype=dt)
                            h5file.create_table(h5file.root, 'knights', dt)
                            h5file.root.knights.append(knights)

Hierarchy at this point:


                            /
                            |-- a_group/
                            |   |-- arther_count
                            |
                            |-- knights

Arrays & tables preserve original flavor


                            h5file.root.a_group.arthur_count[:]
                            type(h5file.root.a_group.arthur_count[:])
                            type(h5file.root.a_group.arthur_count)


                            h5file.root.knights[1] # pull out second row

                            h5file.root.knights[:1] # slice the first row

                            mask = (h5file.root.knights.cols.id[:] < 28)
                            h5file.root.knights[mask] # create mask and apply to table

                            h5file.root.knights[([1, 0],)] #Fancy index the second and first rows, in that order

Memory mapping: pull in data from disk only as needed; HDF5 takes care of this automatically

Can create rows and append to tables


                            table_def = {'time': tables.Float64Col(pos=0),
                                         'temperature': tables.Float64Col(pos=1),
                                         'pressure': tables.Float64Col(pos=2),
                                         }
                            with tables.open_file(filename, mode='w', title="Table title"
                                                  ) as h5file:
                                table = h5file.create_table(where=h5file.root,
                                                            name='simulation',
                                                            description=table_def
                                                            )
                                timestep = table.row
                                # Save initial conditions
                                timestep['time'] = sim.time
                                timestep['temperature'] = sim.temperature
                                timestep['pressure'] = sim.pressure
                                timestep.append()

                                # Main time integration loop
                                while sim.time < time_end:
                                    sim.step() # perform a single integration step

                                    # Save new timestep information
                                    timestep['time'] = sim.time
                                    timestep['temperature'] = sim.temperature
                                    timestep['pressure'] = sim.pressure

                                    # Add ``timestep`` to table
                                    timestep.append()
                                table.flush()

Accessing this table


                            with tables.open_file(filename, 'r') as h5file:
                                # Load Table with Group name simulation
                                table = h5file.root.simulation

                                time = table.col('time')
                                pressure = table.col('pressure')
                                temperature = table.col('temperature')

These are NumPy arrays!

Hierarchy Layout


                            # particles:  id, kind,       velocity
                            particles = [(42, 'electron', 72.0),
                                         (43, 'proton', 0.1),
                                         (44, 'electron', 76.8),
                                         (45, 'neutron', 0.39),
                                         (46, 'neutron', 0.72),
                                         (47, 'neutron', 0.55),
                                         (48, 'proton', 0.18),
                                         (49, 'neutron', 0.23),
                                         ...
                                         ]

If we know we want to look at neutral and charged particles separately, then why search through all the particles all the time?


                            neutral = [(45, 'neutron',  0.39),
                                       (46, 'neutron',  0.72),
                                       (47, 'neutron',  0.55),
                                       (49, 'neutron',  0.23),
                                       ...
                                       ]

                            charged = [(42, 'electron', 72.0),
                                       (43, 'proton', 0.1),
                                       (44, 'electron', 76.8),
                                       (48, 'proton', 0.18),
                                       ...
                                       ]

But now kind is redundant in the neutral table. Let's delete this column and rely on the structure of the tables together to dictate which table refers to what.


                            neutral = [(45, 0.39),
                                       (46, 0.72),
                                       (47, 0.55),
                                       (49, 0.23),
                                       ...
                                       ]

                            charged = [(42, 'electron', 72.0),
                                       (43, 'proton', 0.1),
                                       (44, 'electron', 76.8),
                                       (48, 'proton', 0.18),
                                       ...
                                       ]

We are embedding information directly into the semantics of the hierarchy.

We could add another layer that distinguishes the particles based on detector:


                            /
                            |-- detector1/
                            |   |-- neutral
                            |   |-- charged
                            |
                            |-- detector2/
                            |   |-- neutral
                            |   |-- charged

Data should be broken up like this to improve access time speeds. This is more efficient because there are: fewer rows to search, fewer rows to pull from disk, and fewer columns in description (decreases size of rows)

Creating command-line interfaces for Python programs with argparse

Rather than only calling your package from another Python code, you can (fairly easily) create a command-line interface.


                            myprogram --help

                            myprogram --input inputfile.txt --output output.h5

You can/should also use these to change inputs or settings to your programs without modifying code.

Simple example


                            # contents of example.py
                            # import the necessary packages
                            import sys
                            import argparse

                            # construct the argument parse and parse the arguments
                            parser = argparse.ArgumentParser(
                                description='This is a simple command-line program.'
                                )
                            parser.add_argument('-n', '--name', required=True,
                                                help='name of the user'
                                                )
                            args = parser.parse_args(sys.argv[1:])

                            # display a friendly message to the user
                            print("Hi there {}, it's nice to meet you!".format(args.name))


                            $ python example.py --help
                            $ python example.py --name Kyle

Optional arguments


                            ...
                            parser.add_argument('-c', '--count', action='store_true',
                                                default=False,
                                                help='Count number of characters in name'
                                                )
                            args = parser.parse_args(sys.argv[1:])

                            print("Hi there {}, it's nice to meet you!".format(args.name))
                            if args.count:
                                print("Name length: {}".format(len(args.name)))


                            $ python example.py --help
                            $ python example.py --name Kyle -c

Optional arguments


                            ...
                            parser.add_argument('-a', '--age',
                                                default=25,
                                                type=int,
                                                help='Age of person'
                                                )
                            args = parser.parse_args(sys.argv[1:])

                            print("Hi there {}, it's nice to meet you!".format(args.name))
                            if args.count:
                                print("Name length: {}".format(len(args.name)))
                            print("Age: {}".format(args.age))


                            $ python example.py --help
                            $ python example.py --name Kyle --age 31

More complex input

For larger/more complex input, may be better to use an input file.

YAML is a good option for this: use pyyaml (import yaml)

input.yaml:


                            case: square cavity
                            length: 1.0
                            resolution: 0.01
                            velocities:
                              - 1.0
                              - 10.0
                              - 100.0


                            import yaml
                            with open('input.yaml', 'r') as f:
                                inputs = yaml.safe_load(f)

Returns a dict

Takehome messages

If you need to store numerical data, use HDF5 instead of plaintext.

Consider implementing a command-line interface for your program, possibly in combination with more complex YAML configuration/input files.

Files and Command-Line Inputs in Python

Software Development for Engineering Research

Kyle Niemeyer. 20 January 2022

ME 599, Corvallis, OR

Situations for working with files:

Saving or loading data: file handle object

File handle methods

Example: read in matrix

How to do this better?

File modes

More complicated example: read in matrix and add

Quick intro to HDF5

HDF5: Hierarchical Data Format

Using PyTables for HDF5

Getting started with HDF5 files

HDF5 file modes

Creating nodes, arrays, and tables

Creating nodes, arrays, and tables

Arrays & tables preserve original flavor

Can create rows and append to tables

Accessing this table

Hierarchy Layout

More complicated topics: chunking, in-core and out-of-core operations, querying, compression

Creating command-line interfaces for Python programs with argparse

Simple example

Optional arguments

Optional arguments

More complex input

Takehome messages