Files and Command-Line Inputs in Python

Software Development for Engineering Research


Kyle Niemeyer. 20 January 2022

ME 599, Corvallis, OR

Situations for working with files:


  • Collaborator emails you raw data, want to look at the results
  • You want to email a collaborator data from Python
  • You need to use external code that takes input or data file (potentially 100s/1000s of times); want to automate generation of input files from data
  • An external program writes out one or more results files, and you want to read and perform analysis
  • You want to keep an intermediate calculation for debugging or validation

Saving or loading data: file handle object



                            f = open('data.txt')
                        

This does:

  1. Make sure data.txt exists
  2. Create new handle to this file
  3. Set cursor position pos to start of the file, pos = 0

This does not read into memory any of the file, write anything to the file, or close the file.

File handle methods


  • f.read(n=-1): Reads in n bytes; if n=-1 or not present, read entire rest of file.
  • f.readline(): Read next full line, return string with newline character.
  • f.readlines(): Reads entire rest of file, returns list of strings (with newlines).
  • f.seek(pos): Move file cursor to specified position.
  • f.tell(): Return current position in file.
  • f.write(s): Insert string s at current position.
  • f.flush(): Perform all pending write operations.
  • f.close(): Close file (no more reading or writing).

Example: read in matrix


matrix.txt contents:


                            1,4,15,9
                            0,11,7,3
                            2,8,12,13
                            14,5,10,6
                        


                            f = open('matrix.txt')
                            matrix = []
                            for line in f.readlines():
                                row = [int(x) for x in line.split(',')]
                                matrix.append(row)
                            f.close()
                        

Make sure to close the file!

How to do this better?


Use context manager & with to automatically close:


                            import numpy as np
                            with open('matrix.txt', 'r') as f:
                                lines = f.readlines()

                            matrix = np.array([[int(x) for x in line.split(',')]
                                              for line in lines]
                                              )
                        

... actually, this example can be even easier:


                            import numpy as np
                            matrix = np.genfromtxt('matrix.txt', delimiter=',')
                        

File modes


When opening a file, by default it opens in read-only mode. Other file modes:

  • 'r': Read-only, no writing. Starts at pos = 0.
  • 'w': Write. Creates file if it doesn't exist (if it does, contents are deleted). Starts at pos = 0.
  • 'a': Append. Opens file for writing but does not delete contents; creates file if it doesn't exist. Starting pos is end of file.
  • '+': Update. Opens for reading and writing, does not delete current contents. Starts at pos = 0.

More complicated example: read in matrix and add



                            with open('matrix.txt', 'r+') as f:
                                orig = f.read()
                                f.seek(0)
                                f.write('0,0,0,0\n')
                                f.write(orig)
                                f.write('\n1,1,1,1')
                        

Quick intro to HDF5


Basic idea: better to store structured, numerical data in binary formats over plain-text ASCII files. Why? Smaller.


                            # small ints        # medium ints
                            42   (4 bytes)      123456   (4 bytes)
                            '42' (2 bytes)      '123456' (6 bytes)

                            # near-int floats   # e-notation floats
                            12.34   (8 bytes)   42.424242E+42   (8 bytes)
                            '12.34' (5 bytes)   '42.424242E+42' (13 bytes)
                        

Also, faster I/O, since binary files save in native format.

HDF5: Hierarchical Data Format


HDF5 files store data in binary format.

HDF5 provides database features like storing many datasets, user-defined metadata, optimized I/O, and ability to query contents.

HDF5 is a filesystem in a file: provides nested tree structure for datasets.

Using PyTables for HDF5


import tables as tb

PyTables provides five basic dataset classes:

  • Array: files of the filesystem
  • CArray: chunked arrays
  • EArray: extendable arrays
  • VLArray: variable-length arrays
  • Table: structured arrays

Constructs need to be composed of atomic types:

  • bool: true or false type
  • int: signed integer types
  • uint: unsigned integer types
  • float: floating-point types
  • complex: complex floating-point types
  • string: fixed-length raw string type

Also: Groups, links, and hidden nodes

Getting started with HDF5 files



                            import tables as tb
                            h5file = tb.open_file('/path/to/file', 'a')
                            ...
                        

... or, better yet:


                            import tables as tb
                            with tb.open_file('/path/to/file', 'a') as h5file:
                                ...
                        

HDF5 file modes


  • 'r': Read-only; no data can be modified.
  • 'w': Write; create a new file (delete existing file with that name).
  • 'a': Append; open existing file for reading or writing, or create new file.
  • 'r+': Similar to a, but file must exist.

HDF5 files: all nodes stem from root node: / or h5file.root

Natural naming: can access subnodes as attributes of "parent" nodes, like h5file.root.a_group.some_data

(all relevant nodes in the tree must have names that are valid Python variable names)

Creating nodes, arrays, and tables


Create and access a group on the root node:


                                h5file.create_group(h5file.root, 'a_group', "My Group")
                                h5file.root.a_group
                            

Datasets: arrays and tables, each with a corresponding create method:

  • Arrays are fixed size, must be created with data.
  • Tables have a set data type but are variable length, so we can append to them after creation.

Creating nodes, arrays, and tables



                            # integer array
                            h5file.create_array(h5file.root.a_group, 'arthur_count', [1, 2, 5, 3])

                            # tables need descriptions
                            dt = np.dtype([('id', int), ('name', 'S10')])
                            knights = np.array([(42, 'Lancelot'), (12, 'Bedivere')], dtype=dt)
                            h5file.create_table(h5file.root, 'knights', dt)
                            h5file.root.knights.append(knights)
                        

Hierarchy at this point:


                            /
                            |-- a_group/
                            |   |-- arther_count
                            |
                            |-- knights
                        

Arrays & tables preserve original flavor



                            h5file.root.a_group.arthur_count[:]
                            type(h5file.root.a_group.arthur_count[:])
                            type(h5file.root.a_group.arthur_count)
                        

                            h5file.root.knights[1] # pull out second row

                            h5file.root.knights[:1] # slice the first row

                            mask = (h5file.root.knights.cols.id[:] < 28)
                            h5file.root.knights[mask] # create mask and apply to table

                            h5file.root.knights[([1, 0],)] #Fancy index the second and first rows, in that order
                        

Memory mapping: pull in data from disk only as needed; HDF5 takes care of this automatically

Can create rows and append to tables



                            table_def = {'time': tables.Float64Col(pos=0),
                                         'temperature': tables.Float64Col(pos=1),
                                         'pressure': tables.Float64Col(pos=2),
                                         }
                            with tables.open_file(filename, mode='w', title="Table title"
                                                  ) as h5file:
                                table = h5file.create_table(where=h5file.root,
                                                            name='simulation',
                                                            description=table_def
                                                            )
                                timestep = table.row
                                # Save initial conditions
                                timestep['time'] = sim.time
                                timestep['temperature'] = sim.temperature
                                timestep['pressure'] = sim.pressure
                                timestep.append()

                                # Main time integration loop
                                while sim.time < time_end:
                                    sim.step() # perform a single integration step

                                    # Save new timestep information
                                    timestep['time'] = sim.time
                                    timestep['temperature'] = sim.temperature
                                    timestep['pressure'] = sim.pressure

                                    # Add ``timestep`` to table
                                    timestep.append()
                                table.flush()
                        

Accessing this table



                            with tables.open_file(filename, 'r') as h5file:
                                # Load Table with Group name simulation
                                table = h5file.root.simulation

                                time = table.col('time')
                                pressure = table.col('pressure')
                                temperature = table.col('temperature')
                        

These are NumPy arrays!

Hierarchy Layout



                            # particles:  id, kind,       velocity
                            particles = [(42, 'electron', 72.0),
                                         (43, 'proton', 0.1),
                                         (44, 'electron', 76.8),
                                         (45, 'neutron', 0.39),
                                         (46, 'neutron', 0.72),
                                         (47, 'neutron', 0.55),
                                         (48, 'proton', 0.18),
                                         (49, 'neutron', 0.23),
                                         ...
                                         ]
                        

If we know we want to look at neutral and charged particles separately, then why search through all the particles all the time?


                            neutral = [(45, 'neutron',  0.39),
                                       (46, 'neutron',  0.72),
                                       (47, 'neutron',  0.55),
                                       (49, 'neutron',  0.23),
                                       ...
                                       ]

                            charged = [(42, 'electron', 72.0),
                                       (43, 'proton', 0.1),
                                       (44, 'electron', 76.8),
                                       (48, 'proton', 0.18),
                                       ...
                                       ]
                        

But now kind is redundant in the neutral table. Let's delete this column and rely on the structure of the tables together to dictate which table refers to what.


                            neutral = [(45, 0.39),
                                       (46, 0.72),
                                       (47, 0.55),
                                       (49, 0.23),
                                       ...
                                       ]

                            charged = [(42, 'electron', 72.0),
                                       (43, 'proton', 0.1),
                                       (44, 'electron', 76.8),
                                       (48, 'proton', 0.18),
                                       ...
                                       ]
                        

We are embedding information directly into the semantics of the hierarchy.

We could add another layer that distinguishes the particles based on detector:


                            /
                            |-- detector1/
                            |   |-- neutral
                            |   |-- charged
                            |
                            |-- detector2/
                            |   |-- neutral
                            |   |-- charged
                        

Data should be broken up like this to improve access time speeds. This is more efficient because there are: fewer rows to search, fewer rows to pull from disk, and fewer columns in description (decreases size of rows)

More complicated topics: chunking, in-core and out-of-core operations, querying, compression

Creating command-line interfaces for Python programs with argparse


Rather than only calling your package from another Python code, you can (fairly easily) create a command-line interface.


                            myprogram --help

                            myprogram --input inputfile.txt --output output.h5
                        

You can/should also use these to change inputs or settings to your programs without modifying code.

Simple example


                            # contents of example.py
                            # import the necessary packages
                            import sys
                            import argparse

                            # construct the argument parse and parse the arguments
                            parser = argparse.ArgumentParser(
                                description='This is a simple command-line program.'
                                )
                            parser.add_argument('-n', '--name', required=True,
                                                help='name of the user'
                                                )
                            args = parser.parse_args(sys.argv[1:])

                            # display a friendly message to the user
                            print("Hi there {}, it's nice to meet you!".format(args.name))
                        

                            $ python example.py --help
                            $ python example.py --name Kyle
                        

Optional arguments



                            ...
                            parser.add_argument('-c', '--count', action='store_true',
                                                default=False,
                                                help='Count number of characters in name'
                                                )
                            args = parser.parse_args(sys.argv[1:])

                            print("Hi there {}, it's nice to meet you!".format(args.name))
                            if args.count:
                                print("Name length: {}".format(len(args.name)))
                        

                            $ python example.py --help
                            $ python example.py --name Kyle -c
                        

Optional arguments



                            ...
                            parser.add_argument('-a', '--age',
                                                default=25,
                                                type=int,
                                                help='Age of person'
                                                )
                            args = parser.parse_args(sys.argv[1:])

                            print("Hi there {}, it's nice to meet you!".format(args.name))
                            if args.count:
                                print("Name length: {}".format(len(args.name)))
                            print("Age: {}".format(args.age))
                        

                            $ python example.py --help
                            $ python example.py --name Kyle --age 31
                        

More complex input


For larger/more complex input, may be better to use an input file.

YAML is a good option for this: use pyyaml (import yaml)

input.yaml:


                            case: square cavity
                            length: 1.0
                            resolution: 0.01
                            velocities:
                              - 1.0
                              - 10.0
                              - 100.0
                        

                            import yaml
                            with open('input.yaml', 'r') as f:
                                inputs = yaml.safe_load(f)
                        

Returns a dict

Takehome messages


If you need to store numerical data, use HDF5 instead of plaintext.

Consider implementing a command-line interface for your program, possibly in combination with more complex YAML configuration/input files.