f = open('data.txt')
This does:
data.txt
existspos
to start of the file, pos = 0
This does not read into memory any of the file, write anything to the file, or close the file.
f.read(n=-1)
: Reads in n bytes; if n=-1 or not present, read entire rest of file.
f.readline()
: Read next full line, return string with newline character.
f.readlines()
: Reads entire rest of file, returns list of strings (with newlines).
f.seek(pos)
: Move file cursor to specified position.
f.tell()
: Return current position in file.
f.write(s)
: Insert string s at current position.
f.flush()
: Perform all pending write operations.
f.close()
: Close file (no more reading or writing).
matrix.txt
contents:
1,4,15,9
0,11,7,3
2,8,12,13
14,5,10,6
f = open('matrix.txt')
matrix = []
for line in f.readlines():
row = [int(x) for x in line.split(',')]
matrix.append(row)
f.close()
Make sure to close the file!
Use context manager & with
to automatically close:
import numpy as np
with open('matrix.txt', 'r') as f:
lines = f.readlines()
matrix = np.array([[int(x) for x in line.split(',')]
for line in lines]
)
... actually, this example can be even easier:
import numpy as np
matrix = np.genfromtxt('matrix.txt', delimiter=',')
When opening a file, by default it opens in read-only mode. Other file modes:
'r'
: Read-only, no writing. Starts at pos = 0
.
'w'
: Write. Creates file if it doesn't exist (if it does, contents are deleted).
Starts at pos = 0
.
'a'
: Append. Opens file for writing but does not delete contents;
creates file if it doesn't exist. Starting pos
is end of file.
'+'
: Update. Opens for reading and writing, does not delete current contents.
Starts at pos = 0
.
with open('matrix.txt', 'r+') as f:
orig = f.read()
f.seek(0)
f.write('0,0,0,0\n')
f.write(orig)
f.write('\n1,1,1,1')
Basic idea: better to store structured, numerical data in binary formats over plain-text ASCII files. Why? Smaller.
# small ints # medium ints
42 (4 bytes) 123456 (4 bytes)
'42' (2 bytes) '123456' (6 bytes)
# near-int floats # e-notation floats
12.34 (8 bytes) 42.424242E+42 (8 bytes)
'12.34' (5 bytes) '42.424242E+42' (13 bytes)
Also, faster I/O, since binary files save in native format.
HDF5 files store data in binary format.
HDF5 provides database features like storing many datasets, user-defined metadata, optimized I/O, and ability to query contents.
HDF5 is a filesystem in a file: provides nested tree structure for datasets.
import tables as tb
PyTables provides five basic dataset classes:
Array
: files of the filesystemCArray
: chunked arraysEArray
: extendable arraysVLArray
: variable-length arraysTable
: structured arraysConstructs need to be composed of atomic types:
bool
: true or false typeint
: signed integer typesuint
: unsigned integer typesfloat
: floating-point typescomplex
: complex floating-point typesstring
: fixed-length raw string typeAlso: Groups, links, and hidden nodes
import tables as tb
h5file = tb.open_file('/path/to/file', 'a')
...
... or, better yet:
import tables as tb
with tb.open_file('/path/to/file', 'a') as h5file:
...
'r'
: Read-only; no data can be modified.
'w'
: Write; create a new file (delete existing file with that name).
'a'
: Append; open existing file for reading or writing, or create new file.
'r+'
: Similar to a, but file must exist.
HDF5 files: all nodes stem from root node: /
or h5file.root
Natural naming: can access subnodes as attributes of "parent" nodes, like h5file.root.a_group.some_data
(all relevant nodes in the tree must have names that are valid Python variable names)
Create and access a group on the root node:
h5file.create_group(h5file.root, 'a_group', "My Group")
h5file.root.a_group
Datasets: arrays and tables, each with a corresponding create
method:
# integer array
h5file.create_array(h5file.root.a_group, 'arthur_count', [1, 2, 5, 3])
# tables need descriptions
dt = np.dtype([('id', int), ('name', 'S10')])
knights = np.array([(42, 'Lancelot'), (12, 'Bedivere')], dtype=dt)
h5file.create_table(h5file.root, 'knights', dt)
h5file.root.knights.append(knights)
Hierarchy at this point:
/
|-- a_group/
| |-- arther_count
|
|-- knights
h5file.root.a_group.arthur_count[:]
type(h5file.root.a_group.arthur_count[:])
type(h5file.root.a_group.arthur_count)
h5file.root.knights[1] # pull out second row
h5file.root.knights[:1] # slice the first row
mask = (h5file.root.knights.cols.id[:] < 28)
h5file.root.knights[mask] # create mask and apply to table
h5file.root.knights[([1, 0],)] #Fancy index the second and first rows, in that order
Memory mapping: pull in data from disk only as needed; HDF5 takes care of this automatically
table_def = {'time': tables.Float64Col(pos=0),
'temperature': tables.Float64Col(pos=1),
'pressure': tables.Float64Col(pos=2),
}
with tables.open_file(filename, mode='w', title="Table title"
) as h5file:
table = h5file.create_table(where=h5file.root,
name='simulation',
description=table_def
)
timestep = table.row
# Save initial conditions
timestep['time'] = sim.time
timestep['temperature'] = sim.temperature
timestep['pressure'] = sim.pressure
timestep.append()
# Main time integration loop
while sim.time < time_end:
sim.step() # perform a single integration step
# Save new timestep information
timestep['time'] = sim.time
timestep['temperature'] = sim.temperature
timestep['pressure'] = sim.pressure
# Add ``timestep`` to table
timestep.append()
table.flush()
with tables.open_file(filename, 'r') as h5file:
# Load Table with Group name simulation
table = h5file.root.simulation
time = table.col('time')
pressure = table.col('pressure')
temperature = table.col('temperature')
These are NumPy arrays!
# particles: id, kind, velocity
particles = [(42, 'electron', 72.0),
(43, 'proton', 0.1),
(44, 'electron', 76.8),
(45, 'neutron', 0.39),
(46, 'neutron', 0.72),
(47, 'neutron', 0.55),
(48, 'proton', 0.18),
(49, 'neutron', 0.23),
...
]
If we know we want to look at neutral and charged particles separately, then why search through all the particles all the time?
neutral = [(45, 'neutron', 0.39),
(46, 'neutron', 0.72),
(47, 'neutron', 0.55),
(49, 'neutron', 0.23),
...
]
charged = [(42, 'electron', 72.0),
(43, 'proton', 0.1),
(44, 'electron', 76.8),
(48, 'proton', 0.18),
...
]
But now kind
is redundant in the neutral
table.
Let's delete this column and rely on the structure of the tables together
to dictate which table refers to what.
neutral = [(45, 0.39),
(46, 0.72),
(47, 0.55),
(49, 0.23),
...
]
charged = [(42, 'electron', 72.0),
(43, 'proton', 0.1),
(44, 'electron', 76.8),
(48, 'proton', 0.18),
...
]
We are embedding information directly into the semantics of the hierarchy.
We could add another layer that distinguishes the particles based on detector:
/
|-- detector1/
| |-- neutral
| |-- charged
|
|-- detector2/
| |-- neutral
| |-- charged
Data should be broken up like this to improve access time speeds. This is more efficient because there are: fewer rows to search, fewer rows to pull from disk, and fewer columns in description (decreases size of rows)
Rather than only calling your package from another Python code, you can (fairly easily) create a command-line interface.
myprogram --help
myprogram --input inputfile.txt --output output.h5
You can/should also use these to change inputs or settings to your programs without modifying code.
# contents of example.py
# import the necessary packages
import sys
import argparse
# construct the argument parse and parse the arguments
parser = argparse.ArgumentParser(
description='This is a simple command-line program.'
)
parser.add_argument('-n', '--name', required=True,
help='name of the user'
)
args = parser.parse_args(sys.argv[1:])
# display a friendly message to the user
print("Hi there {}, it's nice to meet you!".format(args.name))
$ python example.py --help
$ python example.py --name Kyle
...
parser.add_argument('-c', '--count', action='store_true',
default=False,
help='Count number of characters in name'
)
args = parser.parse_args(sys.argv[1:])
print("Hi there {}, it's nice to meet you!".format(args.name))
if args.count:
print("Name length: {}".format(len(args.name)))
$ python example.py --help
$ python example.py --name Kyle -c
...
parser.add_argument('-a', '--age',
default=25,
type=int,
help='Age of person'
)
args = parser.parse_args(sys.argv[1:])
print("Hi there {}, it's nice to meet you!".format(args.name))
if args.count:
print("Name length: {}".format(len(args.name)))
print("Age: {}".format(args.age))
$ python example.py --help
$ python example.py --name Kyle --age 31
For larger/more complex input, may be better to use an input file.
YAML is a good option for this: use pyyaml (import yaml
)
input.yaml
:
case: square cavity
length: 1.0
resolution: 0.01
velocities:
- 1.0
- 10.0
- 100.0
import yaml
with open('input.yaml', 'r') as f:
inputs = yaml.safe_load(f)
Returns a dict
If you need to store numerical data, use HDF5 instead of plaintext.
Consider implementing a command-line interface for your program, possibly in combination with more complex YAML configuration/input files.