Welcome
- These notes are part of a course on software development skills for scientists and engineers being prepared by Greg Wilson for the Python Software Foundation
- Please use the links included with the slides to provide comments and feedback
- Comment on this slide
Course Outline
Acknowledgments
- The
Python Software Foundation, for the grant that made this work possible - The
University of Toronto, for letting me test this version of this course on its students - Its
Department of Computer Science, for giving me a home, and making me feel welcome - Brent Gorda, who helped create the first version of this course
- Heather Mayer, for the artwork
YesLogic, for generously donating licenses for Prince (their XML-to-PDF converter)- The many people who commented on this material, and suggested ways to improve it:
| Hossein Bidhendi | Stephane Bortzmeyer | Michelle Craig | Simon Duane | Paul Dubois | Hans Fangohr | Brent Gorda |
| Adam Goucher | Perry Greenfield | Paul Gries | Brandon King | Catherine Letondal | Michelle Levesque | Andy Lumsdaine |
| Laurie MacDougall | Keir Mierle | Kit-Sun Ng | Dirkjan Ochtman | Victor Putz | Irving Reid | Karen Reid |
| Paul Salvini | Diomidis Spinellis | Bill Spotz | Tom Van Vleck | Jim Vickroy | | |
- Frank Willison—I'm sorry this one was finished too late for you to tune up
- John Scopes, and everyone else with the courage to fight for the idea that the truth is more important than doctrine
- Comment on this slide
Introduction
Motivation
- Computers are as important to scientists as telescopes and test tubes
- Analyze problems that are too complex for traditional means
- Simulate things that can't be studied in laboratories
- Many scientists now spend much of their professional lives writing and maintaining software
- A quarter of graduate students in science and engineering spend 25-50% of their time programming
- But most scientists have never been taught how to do this efficiently
- It's a long way from the loops and arrays of first year to simulating bone development in foetal marsupials…
- Like being shown how to differentiate polynomials, then expected to invent the rest of calculus
- This course will teach you how to design, build, maintain, and share programs more efficiently
- Focus: tools and techniques appropriate for half a dozen people working together for a year
- Everything you do at that scale will also make you more productive when you're working on your own for a week
- Will not turn you into a computer scientist
- Far too many of them around anyway
- Instead, goal is to teach you the equivalent of good laboratory technique for computational science
- The 20% of ideas that account for 80% of real world use
- Software carpentry, rather than software engineering
- Comment on this slide
Meeting Standards
- Experimental results are only publishable if they are believed to be correct and reproducible
- Equipment calibrated, samples uncontaminated, relevant steps recorded
- In practice, almost always rely on the professionalism of the people doing the work
- How well do computational scientists meet these standards?
- Correctness of code rarely questioned
- We all know programs are buggy, but when was the last time you saw a paper rejected because of concerns over the quality of the software used to produce the results?
- Reproducibility often nonexistent
- How many people can reproduce, much less trace, each computational result in their thesis?
- The times they are a-changing
- Standards for computational work can only go up
- Change can happen almost overnight
- Like the American car market when German and Japanese imports appeared in the 1970s
- Comment on this slide
The Most Important Idea in This Course
- Improving quality improves productivity
- I.e., the tools and techniques you must adopt to produce better code also help you write more code, faster
- Comment on this slide
Who You Are
- User stories
- Important part of designing user interfaces for mass-distribution software
- Helps make discussion of features and usability more concrete
- Harry
- 27; B.Sc. in zoology
- Did an introductory Fortran course nine years ago, and attended a workshop on web-based bioinformatics tools when he started his job
- Now developing fuzzy pattern-matching algorithms for Genes'R'Us, a biotech firm with labs in four countries
- Ron
- 24; B.Eng in mechanical engineering, now doing an M.Sc. part time
- Did C in first year; has been using MATLAB ever since
- Modeling thermal degradation (a.k.a. “melting”) of firefighters' helmets
- Hermione
- 34; Ph.D. in physics
- Took two courses on C and two on numerical analysis as an undergrad, and a computer graphics course as a graduate student
- Now in charge of the 5-person flywheel braking group at Yoyodyne Inc.
- Ginny
- 22; finished a B.Eng. in chemical engineering last year, now doing an M.Sc. in biochemistry
- Did C in first year, and has built a personal web site (static HTML only)
- Thesis topic is improving the yield of organic fullerene production
- Albus
- 47; Ph.D. in mathematics; studies large random graphs using both analytic and computational techniques
- Undergraduate degree in Mathematics and Computer Science in the 1970s; programs mainly in C++
- Professor at Euphoric State University; former chair of graduate studies
- Comment on this slide
A Quick Self-Test
- Adapted from
[Spolsky 2004]
and
[McConnell 1997]
- 1 for “yes”, 0 for “no”, and -1 if you don't understand the question
- So:
- Do you use version control?
- Can you rebuild everything in one step?
- Do you have an automated test suite?
- Do you build the software, and run the test suite, daily?
- Do you have a bug database?
- Do you use a symbolic debugger?
- Do you use a style checker to ensure that your software is written in a uniform, readable way?
- Can you trace everything you release back to the software that produced it?
- Do you document as you program, and keep your documentation in your source files?
- Can you set up a development environment (including any libraries you need) on a fresh machine without heroic effort?
- Do you have a schedule with small binary milestones?
- Do you estimate how long tasks will take before you start, and compare that with how long they actually took?
- If you're working with others, or developing software that other people may use:
- Do you test your interfaces using paper prototypes before implementing them?
- Is there a searchable archive of discussions about the project?
- Do team members write and share small tools for automating common tasks?
- Does your schedule allow for infrastructure development, training, sick time, etc.?
- And your score is?
- Comment on this slide
Learn by Building
- So why are we where we are?
- It's difficult to learn these things from academic computer scientists
- CS research is more concerned with rapid prototyping than with reliability
- People are naturally sceptical of innovation
- Particularly after they've seen a few bandwagons roll through
- Glass's Law
[Glass 2002]
: any new way of doing things initially slows you down
- You only have to be as good as the competition
- American auto makers in the 1970s
- This course's approach:
- Introduce some basic tools
- Students immediately see benefit of taking the course
- Tools can be used to manage the course itself
- Show students how to build tools like these
- Where “how” includes both what goes into the software, and how to create it
- Solidifies understanding of tools' capabilities and limitations
- Makes discussion of technique more concrete
- Show students what else they can do with their new skills
- The right way to tackle issues that come up over and over again
- Key point: avoid overload
- People who already know these things tend to underestimate how hard they are to learn
- No point preaching to the converted
- Try instead to move the middle of the bell curve to the right
- Why is the course called software “carpentry”?
- Because it focuses on the details of the craft
- If software engineering is about building an electronic version of the Channel Tunnel, this stuff is the equivalent of putting an extension on the house
- Comment on this slide
Topics
- Shell programming
- Version control
- Automating builds
- Python (4 lectures)
- Systematic debugging
- Higher-level programming (2 lectures)
- Testing (2 lectures)
- Coding style and reading code
- Data crunching (4 lectures)
- Web programming and security (3 lectures)
- Software development project tools and processes (4 lectures)
- Comment on this slide
Setting Up
- Some previous programming experience
for loops, if/then/else- Function calls
- Arrays
- File I/O
- Compilation
- Individual setup
- Python (version 2.4 or higher)
- Cygwin (on Windows)
- An editor
- We'll look at smart ones later in the course
- Subversion
- We'll spend a lecture on this next week
- Time
- Expect to spend 2-3 hours outside class for each lecture
- Comment on this slide
Recommended Reading
- “If you only have time to read one book, make time to read two”
-
[Glass 2002]
summarizes what we actually know about programmers' productivity
-
[Hunt & Thomas 1999]
and
[Gunderloy 2004]
are about the things that distinguish good programmers from bad ones
-
[Lutz & Ascher 2003]
is the standard introduction to Python
-
[Langtangen 2004]
is a comprehensive introducton to Python aimed squarely at scientists and engineers
- Goes into much more detail than this course will
- But doesn't address broader issues, such as programming practices
- See the Bibliography for others
- Check out some of the Online Resources as well
- Comment on this slide
Typographic Conventions
Version Control
Problem #1: Synchronizing Files
- Want to work on one set of files on three different machines
- Option 1: use a shared file system
- Difficult to set up
- And even more difficult to make secure
- Inflexible: what if you're on the road?
- Option 2: Carry around a floppy or USB dongle
- Have to remember to copy files onto it and then off again
- What if your project is too big?
- Option 3: mail, FTP, SCP, etc.
- Still have to remember to push and pull exactly the right files at exactly the right time
- Option 4: get the computer to do the work
- Keep a master copy in one place
- Use a program to synchronize working copies when and as needed
- Comment on this slide
Problem #2: Undoing Changes
- Often want to undo changes to a file
- Start work, realize it's the wrong approach, want to get back to starting point
- Like “undo” in an editor…
- …but longer-lived—keep the whole history of every file, forever
- Similarly, often want to see who changed what, and when
- When working in teams, want to see what your partners did
- Bugs are more likely to be in fresh code than in code that's been running for a long time
- Comment on this slide
Solution: Version Control
- Solve both problems at once by using a version control system (VCS)
- Mechanics:
- Keep the master copy of every file in a central repository
- Actually, keep all old versions of every file
- Everyone on the development team does their work in a working copy
- When you're ready to share your changes, you commit them to the repository
- The VCS saves the old version of the file, then writes your changes on top
- Also records the time of the change, who made it, and a comment
- Take a look in a moment at what happens if two or more people try to make changes at the same time
- Comment on this slide
CVS and Subversion
- Two open source version control systems in widespread use
- Many others available commercially
- If you can afford it, use Perforce
- CVS (Concurrent Version System)
- Invented in the 1980s
- Very popular, but showing its age
- Flaw #1: it keeps track of each file separately
- But authors often change several files in tandem
- Since CVS has no notion of a “batch submit”, there's no reliable way to say, “What other changes were made in conjunction with this one?”
- Flaw #2: you can create new directories, but can't ever delete old ones
- Subversion
- Designed as a backward-compatible replacement for CVS starting in 2000
- Fixes both of the major flaws in CVS
- Many open source projects have switched, or are switching…
- …so we'll use it in this course
- See the
Subversion site for details
- Comment on this slide
Basic Use
- Assume for a moment that a repository has been created, and that Ron and Hermione already have working copies
- Ron wants to make changes to the rotor spin simulation
- Runs
svn update to bring his copy up to date with the repository - Edits
spin.c and spin.h - Runs
svn commit to save those changes in the repository- Note: the whole repository gets a new version number
- Ron realizes that he forgot to make one change
- Runs
svn update again, just in case Hermione has also been making changes (she hasn't) - Edits
spin.c again - Runs
svn commit a second time
- Several hours later, Hermione runs
svn update on her working copy- Subversion copies Ron's changes into her directory
![[Basic Use]](img/version/basic_use.png)
Figure 3.1: Basic Use |
- Comment on this slide
How To Do It
- One way to use Subversion is to type commands in a shell
- Guaranteed to work everywhere without anything else being installed
- But there are several good graphical interfaces for Subversion too
RapidSVN runs on Windows, Linux, and Mac- Well, maybe “walks” is a better description—as of Version 0.9, it's not the fastest thing in the world
![[RapidSVN]](img/version/rapidsvn.png)
Figure 3.2: RapidSVN |
TortoiseSVN is a Windows shell extension- Which means that it integrates with the Windows file browser, rather than running separately
- But that also means that it doesn't work on Linux or Mac
![[TortoiseSVN]](img/version/tortoisesvn.png)
Figure 3.3: TortoiseSVN |
- And if you're on a Macintosh, there's
SmartSVN - Comment on this slide
Working Together
- What if two (or more) people want to edit the same file at the same time?
- Option 1: prevent it
- Only allow one person to have a writeable copy of the file at once
- Pessimistic concurrency
- Microsoft Visual SourceSafe
- Option 2: patch up afterwards
![[Merging Conflicts]](img/version/conflict_merge.png)
Figure 3.4: Merging Conflicts |
- Ron and Hermione both have copies of
spin.c version 15int maxRotateSetting(int * available, int length)
{
int i, maxFound;
if (length == 0) {
return 0;
}
for (i=0; i<length; i+=1) {
if (available[i] > maxFound) {
maxFound = available[i];
}
}
return maxFound;
}
- Ron commits his changes
- Creates
spin.c version 16 // Find maximum rotation setting from those available,
// or 0 if none are available.
int maxRotateSetting(int * available, int length)
{
int i, maxFound;
if (length == 0) {
return 0;
}
for (i=0; i<length; i+=1) {
if (available[i] > maxFound) {
maxFound = available[i];
}
}
return maxFound;
}
- Meanwhile, Hermione is editing her copy, and produces this:
// Find maximum rotation setting, or 0.
int maxRotateSetting(int * available, int length)
{
int i, maxFound = 0;
for (i=0; i<length; i+=1) {
if (available[i] > maxFound) {
maxFound = available[i];
}
}
return maxFound;
}
- Tries to submit her changes: conflict!
- Subversion puts both Ron's and Hermione's changes in Hermione's working copy, with markers
<<<<<<< .mine
// Find maximum rotation setting, or 0.
=======
// Find maximum rotation setting from those available,
// or 0 if none are available.
>>>>>>> .r471
int maxRotateSetting(int * available, int length)
{
int i, maxFound = 0;
for (i=0; i<length; i+=1) {
if (available[i] > maxFound) {
maxFound = available[i];
}
}
return maxFound;
}
- Also creates
spin.c.mine, spin.c.15, and spin.c.16 for reference - Hermione must decide what to keep, and what to throw away
- Subversion won't let her commit his changes until all the conflict markers are eliminated
- Once Hermione is done editing, she:
- Runs
svn resolved spin.c to tell Subversion the conflict has been fixed - Runs
svn commit spin.c to create spin.c version 17
- Comment on this slide
What Versions Actually Mean
- The discussion above referred to “version 16 of
spin.c”, but in fact there is no such thing - Instead, there's version 16 (or 17, or 18…) of the repository
- Users are supposed to try to keep the files in the repository in a consistent state
- I.e., don't submit things that are half-done
- Since the next person to do an update would then be in the same half-done state you are
- Subversion therefore updates the version number on the whole repository every time a set of changes is submitted
- Each change set can affect any number of files (including adding or deleting files)
- The phrase “version 229” therefore uniquely identifies an entire set of files
- Unlike CVS and other systems, where version 319 of one file might correspond to version 107 of another, and version 794 of a third
- Comment on this slide
Warning: Binary Files
- Subversion can only mark conflicts this way in text files
- I.e., files that store lines of human-readable characters
- Source code, HTML—basically, anything you can edit with Notepad, Vi, or Emacs
- Images, video clips, Microsoft Word, and many other formats aren't
- When there's a conflict, Subversion saves your copy and the master copy side by side in your working directory
- Up to you to resolve the differences
- Comment on this slide
Rolling Back Changes
- Suppose Ron decides that he doesn't like his recent changes
svn diff will show him which files he has changed, and what those changes are- He hasn't committed anything yet, so he can use
svn revert to re-synchronize with the master copy - If you find yourself doing this repeatedly, you should probably go and do something else for a while…
- Now suppose that Ron decides he doesn't like the changes Hermione just made to
spin.c
- Wants to do the equivalent of “undo” on several files
svn log shows recent history- He decides he wants to revert to version 16 of the repository
svn merge -r 17:16 spin.c means “merge changes, going from version 17 to version 16” (i.e., backwards)
- Can obviously go back to even earlier versions to undo more changes
![[Undoing Changes]](img/version/merge_undo.png)
Figure 3.5: Undoing Changes |
- Comment on this slide
And Finally, Getting Started
- To create a repository:
- Decide where to put it (e.g.,
/rotor/repo) - Go into the containing directory:
cd /rotor svnadmin create repo
- Can then interact with repository in two ways
- Directly through the file system:
file:///rotor/repo
- Use this if you're working on the same machine the repository is on
- Through a web server:
https://your.host.name/rotor/repo
- Use this if the repository is on a remote machine
- Note: requires your system administrator to configure the web server properly
https (instead of http) means “use a secure connection”
- To get a working copy (assuming you're using a web server):
svn checkout https://your.host.name/rotor/repo- Creates a new directory
repo
- Common to give it a more informative name using
svn checkout https://your.host.name/rotor/repo rotorproject
- Important: only use
svn checkout once, to initialize your working copy
- Comment on this slide
Subversion Command Reference
| Name | Purpose | Example |
|---|
svn add | Add files and/or directories to version control. | svn add newfile.c newdir |
svn checkout | Get a fresh working copy of a repository. | svn checkout https://your.host.name/rotor/repo rotorproject |
svn commit | Send changes from working copy to repository (inverse of update). | svn commit -m "Comment on the changes" |
svn delete | Delete files and/or directories from version control. | svn delete oldfile.c |
svn help | Get help (in general, or for a particular command). | svn help update |
svn log | Show history of recent changes. | svn log --verbose *.c |
svn merge | Merge two different versions of a file into one. | svn merge -r 18:16 spin.c |
svn mkdir | Create a new directory and put it under version control. | svn mkdir newmodule |
svn rename | Rename a file or directory, keeping track of history. | svn rename temp.txt release_notes.txt |
svn revert | Undo changes to working copy (i.e., resynchronize with repository). | svn revert spin.h |
svn status | Show the status of files and directories in the working copy. | svn status |
svn update | Bring changes from repository into working copy (inverse of commit). | svn update |
Table 3.1: Common Subversion Commands
- Comment on this slide
How to Read Subversion Output
svn status compares your working copy with the repository, printing one line for each file that's worth talking about$ svn status
M spin.c
MC readme.txt
spin.c has been modifiedreadme.txt has been modified, and has conflicts
svn update prints one line for each file or directory it does something to$ svn update
A newspin.c
U spin.c
C spin.h
newspin.c has been addedspin.c has been updated (i.e., someone else modified it)- There's a conflict in
spin.h
- Which you'll have to resolve before you can commit your changes
- Comment on this slide
Branching and Merging
- Sometimes want to work on several different versions of software at once
- Example: need to do bug fixes on Version 3 while making incompatible changes toward Version 4
- Or want two sets of developers to be able to write and test large changes independently, then put things back together
- All modern version control systems allow you to branch a repository
- Create a “parallel universe” which is initially the same as the original, but which evolves independently
![[Branching and Merging]](img/version/branch_and_merge.png)
Figure 3.6: Branching and Merging |
- Much better than just copying all the source files: the version control system remembers where the branch came from, and can trace its history back
- Can later merge changes from one branch to another
- Example: fix a bug on one branch, merge the changes into other branches that have the same bug
- Again, much better than copying by hand, since the version control system can keep track of where things came from, and where they went
- Warning: many people become over-excited about branching when they first start to use it
- Keeping track of what's going on where can be a considerable management overhead
- On a small project, very rare to need more than two active branches
- Comment on this slide
Exercises
Exercise 3.1:
Follow the instructions given to you by your instructor to
check out a copy of the Subversion repository you'll be using in
this course. Unless otherwise noted, the exercises below
assume that you have done this, and that your working copy is in
a directory called course. You will submit all of your
exercises in this course by checking files into your
repository.
Exercise 3.2:
Create a file course/ex01/bio.txt (where
course is the root of your working copy of your
Subversion repository), and write a short biography of yourself
(100 words or so) of the kind used in academic journals,
conference proceedings, etc. Commit this file to your
repository. Remember to provide a meaningful comment when
committing the file!
Exercise 3.3:
What's the difference between mv and svn
mv? Put the answer in a file called
course/ex01/mv.txt and commit your changes.
Once you have committed your changes, type svn
log in your course directory. If you didn't know
what you'd just done, would you be able to figure it out from
the log messages? If not, why not?
Exercise 3.4:
In this exercise, you'll simulate the actions of two
people editing a single file. To do that, you'll need to check
out a second copy of your repository. One way to do this is to
use a separate computer (e.g., your laptop, your home computer,
or a machine in the lab). Another is to make a temporary
directory, and check out a second copy of your repository there.
Please make sure that the second copy isn't inside the first, or
vice versa—Subversion will become very confused.
Let's call the two working copies Blue and Green. Do the
following:
a) Create Blue/ex01/planets.txt, and add the
following lines:
Mercury
Venus
Earth
Mars
Jupiter
Saturn
Commit the file.
b) Update the Green repository. (You should get a copy of
planets.txt.)
c) Change Blue/ex01/planets.txt so that it reads:
1. Mercury
2. Venus
3. Earth
4. Mars
5. Jupiter
6. Saturn
Commit the changes.
d) Edit Green/ex01/planets.txt so that its contents
are as shown below. Do not do svn update
before editing this file, as that will spoil the
exercise.
Mercury 0
Venus 0
Earth 1
Mars 2
Jupiter 16 (and counting)
Saturn 14 (and counting)
e) Now, in Green, do svn update. Subversion
should tell you that there are conflicts in planets.txt.
Resolve the conflicts so that the file contains:
1. Mercury 0
2. Venus 0
3. Earth 1
4. Mars 2
5. Jupiter 16
6. Saturn 14
Commit the changes.
f) Update the Blue repository, and check that
planets.txt now has the same content as it has in the
Green repository.
Exercise 3.5:
Add another line or two to course/ex01/bio.txt and
commit those changes. Then, use svn merge to restore
the original contents of your biography
(course/ex01/bio.txt), and commit the result. When you
are done, bio.txt should look the way it did at the end
of the first part of the previous exercise.) Note: the purpose
of this exercise is to teach you how to go back in time to get
old versions of files—while it would be simpler in this
case just to edit bio.txt, you can't (reliably) do that
when you've made larger changes, to multiple files, over a
longer period of time.
Shell Basics
Introduction
- Most modern tools have a graphical user interface (GUI)
- Because they're easier to use
- But command-line user interfaces (CLUIs) still have their place
- Easier (faster) to build new CLUI tools
- Building a GUI takes time
- Building a good GUI takes a lot of time
- Higher action-to-keystroke ratio
- Once you're over the (steeper) learning curve
- Easier to see and understand what the computer is doing on your behalf
- Which is part of what this course is about
- Most important: it's easier to combine CLUI tools than GUI tools
- Small tools, combined in many ways, can be very powerful
-
[Ray & Ray 2003]
is a good introduction for newcomers
- How to tell if you can skip this lecture
- Do you know what a shell is?
- Do you know the difference between an absolute path and a relative path?
- Do you know what a process is?
- Do you know what a pipe is?
- Do you know what
$PATH is? - Do you know what
rwxr-xr-x means?
- Comment on this slide
The Shell vs. the Operating System
- The most important command-line tool is the command shell (often just called “the shell”)
- Manages a user's interactions with the operating system by:
- Reading commands from the keyboard
- Figuring out what programs the user wants to run
- Running those programs
- Displaying their output on the screen
- Looks (and works) like an interactive terminal circa 1980
![[A Shell in Action]](img/shell01/shell_screenshot.png)
Figure 4.1: A Shell in Action |
- The shell is just one program among many
- Many different ones have been written
sh was the first for Unix- Most others extend its capabilities in various ways
- Which means that it's the lowest common denominator you can always rely on
- We'll use
bash (the Bourne again shell) in this course- Available just about everywhere
- Even on Windows (thanks to
Cygwin)
- In contrast, the operating system is not just another program
- Automatically loaded when the computer boots up
- The only program that can talk directly to the computer's hardware
- I.e., read characters from the keyboard, or send drawing commands to the screen
- Manages files and directories on the disk
- Keeps track of who you are, and what you're allowed to do
- You can run many instances of the shell on a computer at once, but it can only run one operating system at a time
![[Operating System and Shell]](img/shell01/os_shell.png)
Figure 4.2: Operating System and Shell |
- Comment on this slide
The File System
- The file system is the set of files and directories the computer can access
- “Everything that stays put when you turn the computer off and restart it”
- Data is stored in files
- By convention, files have two part names, like
notes.txt or home.html - Most operating systems allow you to associate a filename extension with an application
- E.g.,
.txt is associated with an editor, and .html with a web browser
- But this is all just convention: you can call files (almost) anything you want
- Files are stored in directories (often called folders)
- Directories can contain other directories, too
- Results in the familiar directory tree
![[A Directory Tree]](img/shell01/directory_tree.png)
Figure 4.3: A Directory Tree |
- Everything in a particular directory must have a unique name
- Otherwise, how would you identify it?
- But items in different directories can have the same name
- On Unix, the file system has a unique root directory called
/
- Every other directory is a child of it, or a child of a child, etc.
- On Windows, every drive has its own root directory
- So
C:\home\gvwilson\notes.txt is different from J:\home\gvwilson\notes.txt - When you're using Cygwin, you can also write
C:\home\gvwilson as c:/home/gvwilson - Or as
/cygdrive/c/home/gvwilson
- Some Unix programs give
":" a special meaning, so Cygwin needed a way to write paths without it…
- A path is a description of how to find something in a file system
- An absolute path describes a location from the root directory down
- Equivalent to a street address
- Always starts with
"/" - E.g.,
/home/gvwilson is my home directory, and /courses/swc/lec/shell.swc is this file
- A relative path describes how to find something from some other location
- Equivalent to saying, “Four blocks north, and seven east”
- E.g., from
/courses/swc, the relative path to this file is lec/shell.swc
- Every program (including the shell) has a current working directory
- “Where am I?”
- Relative paths are deciphered relative to this location
- It can change while a program is running
- Finally, two special names:
"." means “the current directory”".." means “the directory immediately above this one- Also called the parent directory
- In
/courses/swc/data, .. is /courses/swc - In
/courses/swc/data/elements, .. is /courses/swc/data ![[Parent Directories]](img/shell01/parent_directory.png)
Figure 4.4: Parent Directories |
- Comment on this slide
A Few Simple Commands
- Easiest way to learn basic Unix commands is to see them in action
- First, I type
pwd (short for "print working directory”) to find out where I am- Unfortunately, most Unix commands have equally cryptic names
- I then type
ls (for “listing”) to see what's in the current directoryls
LICENSE.txt admin data graphics lec pdf scraps util
Makefile cgi-bin etc img mp3 publ src web
- What actually happens when I type
ls is:
- The operating system reads characters from the keyboard
- Passes them to the shell (because it's the currently active window on my desktop)
- The shell breaks the line of text it receives into words
- Looks for a program with the same name as the first word (i.e., the command to run)
- Describe in a moment how the shell knows where to look
- Runs that program
- Reads the program's output and sends it back to the operating system for display
![[Running a Program]](img/shell01/shell_running_program.png)
Figure 4.5: Running a Program |
- I can tell
ls to produce more informative output by giving it some flags
- By convention, flags start with
"-", as in "-c" or "-l" - Show directories with trailing slash
ls -F
LICENSE.txt admin/ data/ exer/ lec/ publ/ soln/ util/
Makefile cgi-bin/ etc/ img/ pdf/ scraps/ src/ web/
- Show all files and directories, including those whose names begin with
.
ls -a
. .. .svn admin data
exer img lec publ soln
src util tmpl README.txt license.txt
todo.txt
- By default,
ls doesn't show anything whose name begins with . - Note the
.svn directory: this is where Subversion keeps administrative information - Do not edit anything in this directory yourself, or Subversion will become very confused
- Comment on this slide
Creating Files and Directories
- Rather than messing with the course files, let's create a temporary directory and play around in there
- Note: no output
- The
-v (“verbose”) flag tells mkdir to print a confirmation message
- Now go into that directory
- Changes the shell's notion of our current working directory
pwd
/home/gvwilson/swc/temp
- No files there yet:
- Use the editor of your choice to create a file called
earth.txt with the following contents:
Name: Earth
Period: 365.26 days
Inclination: 0.00
Eccentricity: 0.02
- Notepad (on Windows) runs in a window of its own
- Pico (on Unix) takes over the shell window temporarily
- We'll look at more advanced editing tools for programming in a few lectures
- Easiest way to create a similar file
venus.txt is to copy the one we havels -t
venus.txt earth.txt
- Note: the
-t option tells ls to list newest first
- Check the contents of the file using
cat (short for “concatenate”)
- Just prints the contents of a file to the screen
cat venus.txt
Name: Earth
Period: 365.26 days
Inclination: 0.00
Eccentricity: 0.02
- Edit the file so that it looks like this:
Name: Venus
Period: 224.70 days
Inclination: 3.39
Eccentricity: 0.01
- Compare the sizes of the two files using
wc (for “word count”)
wc earth.txt venus.txt
4 9 69 earth.txt
4 9 69 venus.txt
8 18 138 total
- Columns show lines, words, and characters
- Comment on this slide
Wildcards
- Some characters (called wildcards) mean special things to the shell
* matches zero or more characters- So
ls *.f77 lists all the Fortran-77 files in a directory
wc *.txt
4 9 69 earth.txt
4 9 69 venus.txt
8 18 138 total
? matches any single character- So
ls ??.txt lists all the text files with two-letter prefixes - And
ls ??.* lists all the files with two-letter prefixes, and any extension
~ on its own means “my home directory”
- I.e., the one I'm in when I first log in
~harry means “Harry's home directory”
- Note: the shell expands wildcards before running commands
- There's no way for
ls to know whether it was invoked as ls *.txt or rm earth.txt venus.txt
- Comment on this slide
Exercises
Exercise 4.1:
Suppose you are in your home directory, and ls shows
you this:
Makefile biography.txt data
enrolment.txt programs thesis
What argument(s) do you have to give to ls to get it
to put a trailing slash after the names of subdirectories, like
this:
Makefile biography.txt data/
enrolment.txt programs/ thesis/
If you run ls data, it shows:
earth.txt jupiter.txt mars.txt
mercury.txt saturn.txt venus.txt
What command should you run to get the following output:
data/earth.txt data/jupiter.txt data/mars.txt
data/mercury.txt data/saturn.txt data/venus.txt
What if you want this (note that an extra entry is being
displayed):
total 7
drwxr-xr-x 7 someone 0 May 6 08:27 .svn
-rw-r--r-- 1 someone 2396 May 6 08:38 earth.txt
-rw-r--r-- 1 someone 1263 May 6 08:38 jupiter.txt
-rw-r--r-- 1 someone 1015 May 6 08:43 mars.txt
-rw-r--r-- 1 someone 946 May 6 08:41 mercury.txt
-rw-r--r-- 1 someone 1714 May 6 08:40 saturn.txt
-rw-r--r-- 1 someone 881 May 6 08:40 venus.txt
Note: the command will display your user ID, rather than
someone. On some machines, the command will also display a
group ID. Ignore these differences for the purpose of this
question.
Exercise 4.2:
According to the listing of the data directory above, who
can read the file mercury.txt? Who can write it (i.e., change
its contents or delete it)? When was mercury.txt last
changed? What command would you run to allow everyone to edit or
delete the file?
Exercise 4.3:
Suppose you want to remove all files whose names (not including
their extensions) are of length 3, start with the letter a, and
have .txt as extension. What command would you use? For
example, if the directory contains three files a.txt,
abc.txt, and abcd.txt, the command should remove
abc.txt , but not the other two files.
Exercise 4.4:
What does the command cd ~ do? What about cd
~gvwilson?
Exercise 4.5:
What's the difference between the commands cd HOME
and cd $HOME?
Exercise 4.6:
Suppose you want to list the names of all the text files in the
data directory that contain the word "carpentry". What
command or commands could you use?
Exercise 4.7:
Suppose you have written a program called analyze. What
command or commands could you use to display the first ten lines of
its output? What would you use to display lines 50-100? To send
lines 50-100 to a file called tmp.txt?
Exercise 4.8:
The command ls data > tmp.txt writes a listing of
the data directory's contents into tmp.txt. Anything
that was in the file before the command was run is overwritten. What
command could you use to append the listing to tmp.txt
instead?
Exercise 4.9:
What command(s) would you use to find out how many
subdirectories there are in the lectures directory?
Exercise 4.10:
What does rm *.ch? What about rm
*.[ch]?
Exercise 4.11:
What command(s) could you use to find out how many instances of
a program are running on your computer at once? For example, if you
are on Windows, what would you do to find out how many instances of
svchost.exe are running? On Unix, what would you do to
find out how many instances of bash are running?
Exercise 4.12:
What do the commands pushd, popd,
and dirs do? Where do their names come from?
Exercise 4.13:
How would you send the file earth.txt to the
default printer? How would you check it made it (other than
wandering over to the printer and standing there)?
Exercise 4.14:
A colleague asks for your data files. How would you
archive them to send as one file? How could you compress them?
Exercise 4.15:
The instructor wants you to use a hitherto unknown command
for manipulating files. How would you get help on this command?
Exercise 4.16:
You have changed a text file on your home PC, and mailed
it to the university terminal. What steps can you take to see
what changes you may have made, compared with a master copy in
your home directory?
Exercise 4.17:
How would you change your password?
Exercise 4.18:
grep is one of the more useful tools in the
toolbox. It finds lines in files that match a pattern and
prints them out. For example, assume I have files
earth.txt and venus.txt containing lines like
this:
Name: Earth
Period: 365.26 days
Inclination: 0.00
Eccentricity: 0.02
If I type grep Period *.txt in that
directory, I get:
earth.txt:Period: 365.26 days
venus.txt:Period: 224.70 days
Search strings can use regular expressions, which will be
discussed in a later lecture.
grep takes many options as well; for example,
grep -c /bin/bash /etc/passwd reports how many lines
in /etc/passwd (the Unix password file) that contain the
string /bin/bash, which in turn tells me how many users
are using bash as their shell.
Suppose all you wanted was a list of the files that
contained lines matching a pattern, rather than the matches
themselves—what flag or flags would you give to
grep? What if you wanted the line numbers of
matching lines?
Exercise 4.19:
diff finds and displays the differences between
two files. It works best if both files are plain text (i.e.,
not images or Excel spreadsheets). By default, it shows the
differences in groups, like this:
3c3,4
< Inclination: 0.00
---
> Inclination: 0.00 degrees
> Satellites: 1
(The rather cryptic header "3c3,4" means that line 3
of the first file must be changed to get lines 3-4 of the
second.)
What flag(s) should you give diff to tell it to
ignore changes that just insert or delete blank lines? What if
you want to ignore changes in case (i.e., treat lowercase and
uppercase letters as the same)?
Exercise 4.20:
Suppose you wanted ls to sort its output by
filename extension, i.e., to list all .cmd files before
all .exe files, and all .exe's before all
.txt files. What command or commands would you
use?
More Shell
Redirecting Input and Output
- A running program is called a process
- By default, every process has three connections to the outside world:
- You can tell the shell to connect standard input and standard output to files instead
command < inputFile reads from inputFile instead of from the keyboard- Don't need to use this very often, because most Unix commands let you specify the input file (or files) as command-line arguments
command > outputFile writes to outputFile instead of to the screen- Only “normal” output goes to the file, not error messages
command < inputFile > outputFile does both![[Redirecting Standard Input and Output]](img/shell02/redirecting_stdio.png)
Figure 5.2: Redirecting Standard Input and Output |
- Example: save number of lines in all text files to
words.len
$ wc -w *.txt > words.len
- Nothing appears, because output is being sent to the file
words.len $ ls -t
words.len venus.txt earth.txt
$ cat words.len
4 9 69 earth.txt
4 9 69 venus.txt
3 12 62 words.len
11 30 200 total
- Try typing
cat > junk.txt
- No input file specified, so
cat reads from the keyboard - Output sent to a file
- Voila: the world's dumbest text editor
- When you're done, use
rm junk.txt to get rid of the file- Don't type
rm * unless you're really, really sure that's what you want to do…
- Comment on this slide
Pipes
- Suppose you want to use the output of one program as the input of another
- E.g., use
wc -w *.* to count the words in some files, then sort -n to sort numerically
- Option 1: send output of first command to a temporary file, then read from that file
wc *.txt > temp
sort -n < temp
- Option 2: use a pipe to connect the two programs
- Written as
"|" - Tells the operating system to send what the first program writes to its stdout to the second program's stdin
wc -w *.* | sort -n
9 earth.txt
9 venus.txt
12 words.len
30 total
![[Pipes]](img/shell02/pipes.png)
Figure 5.3: Pipes |
- More convenient (and much less error prone) than temporary files
- Can chain any number of commands together
- And combine with input and output redirection
wc *.txt | sort -n | head -5 > shortest.files
- Any program that reads from standard input and writes to standard output can use redirection and pipes
- Programs that do this are often called filters
- If you make your programs work like filters, you can easily combine them with others
- A combinatorial explosion of goodness
- Comment on this slide
Environment Variables
- Like any other program, the shell has variables
- Since they define a user's environment, they are usually called environment variables
- Type
set at the command prompt to get a listing:
$ set
ANT_HOME=C:/apache-ant-1.6.2
BASH=/usr/bin/bash
COLUMNS=80
COMPUTERNAME=ISHAD
HISTFILESIZE=500
- Get a particular variable's value by putting a
"$" in front of its name- E.g., the shell replaces
"$HOME" with the current user's home directory - Often use the
echo command to print this out$ echo $HOME
/home/gvwilson
- Question: why must you type
echo $HOME, and not just $HOME?
- To set or reset a variable's value temporarily, use this:
- Only affects the current shell (and programs run from it)
- To set a variable's value automatically when you log in, set it in
~/.bashrc
- Remember,
"~" is a shortcut meaning “your home directory”
- For me, right now,
~/.bashrc is /home/gvwilson/.bashrc
- Important environment variables
| Name | Typical Value | Notes |
|---|
HOME | /home/gvwilson | The current user's home directory |
HOMEDRIVE | C: | The current user's home drive (Windows only) |
HOSTNAME | "ishad" | This computer's name |
HOSTTYPE | "i686" | What kind of computer this is |
OS | "Windows_NT" | What operating system is running |
PATH | "/home/gvwilson/bin:/usr/local/bin:/usr/bin:/bin:/Python24/" | Where to look for programs |
PWD | /home/gvwilson/swc/lec | Present working directory (sometimes CWD, for current working directory) |
SHELL | /bin/bash | What shell is being run |
TEMP | /tmp | Where to store temporary files |
USER | "gvwilson" | The current user's ID |
Table 5.1: Important Environment Variables
- Comment on this slide
How the Shell Finds Programs
- The most important of these variables is
PATH
- The search path that tells the shell where to look for programs
- When you type a command like
tabulate, the shell:
- Splits
$PATH on colons to get a list of directories - Looks for the program in each directory, in left-to-right order
- Runs the first one that it finds
- Example
PATH is /home/gvwilson/bin:/usr/local/bin:/usr/bin:/bin:/Python24- Both
/usr/local/bin/tabulate and /home/gvwilson/bin/tabulate exist /home/gvwilson/bin/tabulate will be run when you type tabulate at the command prompt- Can run the other one by specifying the path, instead of just the command name
- Warning: it is common to include
. in your path- This allows you to run a program in the current directory just by typing
whatever, instead of ./whatever - But it also means you can never be quite sure what program a command will invoke
- Though you can use the command
which program_name, which will tell you
- Common entries in
PATH include:
/bin, /usr/bin: core tools like ls
- Note: the word “bin” comes from “binary”, which is geekspeak for “a compiled program”
/usr/local/bin: optional (but common) tools, like the gcc$HOME/bin: tools you have built for yourself- Remember,
$HOME means “the user's home directory” - So this is equivalent to
~/bin
- Cygwin does things a little differently
- Uses the notation
/cygdrive/c/somewhere instead of Windows' c:/somewhere
- The colon in
c:/somewhere would clash with the colons in the PATH variable
- By default, Cygwin treats
c:/cygwin as the root of its file system- So
/home/aturing is a synonym for c:/cygwin/home/aturing - Yes, it can be confusing, but remember: we're trying to run one operating system's tools on top of another
- Comment on this slide
Basic Tools
man | Documentation for commands. |
cat | Concatenate and display text files. |
cd | Change working directory. |
chmod | Change file and directory permissions. |
clear | Clear the screen. |
cp | Copy files and directories. |
date | Display the current date and time. |
diff | Show differences between two text files. |
echo | Print arguments. |
env | Show environment variables. |
head | Display the first few lines of a file. |
ls | List files and directories. |
mkdir | Make directories. |
more | Page through a text file. |
mv | Move (rename) files and directories. |
od | Display the bytes in a file. |
passwd | Change your password. |
pwd | Print current working directory. |
rm | Remove files. |
rmdir | Remove directories. |
sort | Sort lines. |
tail | Display the last few lines of a file. |
uniq | Remove duplicate lines. |
wc | Count lines, words, and characters in a file. |
which | locate a command |
Table 5.2: Basic Command-Line Tools
- Comment on this slide
Ownership and Permission: Unix
- On Unix, every user belongs to one or more groups
- The
groups command will show you which ones you are in
- Every file is owned by a particular user and a particular group
- Owner can assign different read, write, and execute permissions to user, group, and others
- Read: can look at contents, but not modify them
- Write: can modify contents
- Execute: can run the file (e.g., it's a program)
ls -l will show all of this information- (Along with the file's size and a few other things)
- Permissions displayed as three
rwx triples - “Missing” permissions shown by
"-" - Example:
rw-rw-r-- means “user and group can read and write; everyone else can read; no one can execute”
- Change permissions using
chmod
- Example:
chmod u+x something.exe gives the user execute permission to something.exe - Example:
chmod o-r notes.txt takes away the world's read permission for notes.txt
- Permissions mean something a little different for directories
- Execute permission means you can “go into” a directory, but does not mean you can read its contents
- So if a directory called
tools has permission rwx--x--x (i.e., owner can do anything, but everyone else only has execute permission), then:
- If someone other than the owner does
ls tools, permission is denied - But if there's a useful program called
tools/findanswers, other users can still run it
- Comment on this slide
Ownership and Permission: Windows
- Of course, it all works differently on Windows
- Not better or worse, just differently
- Windows XP uses access control lists (ACLs)
- Instead of describing users as “file owner, group member, or something else”, ACLs let you specify exactly what any particular user, or set of users, can do to a file, directory, device, etc.
- Older versions of Windows (such as Windows 95 and Windows 2000) are fundamentally insecure, and shouldn't be used
- Cygwin does its best to make the Windows model look like Unix's
- If you trip over the differences, please consult a system administrator
- Comment on this slide
More Advanced Tools
du | Print the disk space used by files and directories. |
find | Find files that match a pattern. |
grep | Print lines matching a pattern. |
gunzip | Uncompress a file. |
gzip | Compress a file. |
lpr | Send a file to a printer. |
lprm | Remove a print job from a printer's queue. |
lpq | Check the status of a printer's queue. |
ps | Display running processes. |
tar | Archive files. |
which | Find the path to a program. |
who | See who is logged in. |
xargs | Execute a command for each line of input. |
Table 5.3: Advanced Command-Line Tools
- Comment on this slide
Exercises
Exercise 5.1:
You're worried your data files can be read by your
nemesis, Dr. Evil. How would you check whether or not he can,
and if necessary change permissions so only you can read or
write the files?
Basic Scripting
Why Python?
- Two factors determine time to solution:
- How long it takes to write a program (human time)
- How long it takes that program to run (machine time)
- Different languages make different tradeoffs between programming speed and execution speed
- Programmers write the same number of lines of code per day no matter what language they're using, so high-level languages are good
- But the more abstract the language, the more work it is for the computer to figure out what you want it to do
![[Nimble vs. Sturdy Languages]](img/py01/nimble_vs_sturdy.png)
Figure 6.1: Nimble vs. Sturdy Languages |
Python is:
- Like the shell, only better
- Freely available for many platforms
- Widely used
- Well documented
- (Much) easier to read than Perl
- Material that took three days to teach in Perl took only two to teach in Python
- Follow-up surveys showed significantly higher retention rates
- (Much) slower than C/C++ or Fortran
- 10-100 times slower than compiled, optimized code
- But it's relatively easy to call C/C++ and Fortran libraries from Python
- Doesn't have all of MATLAB's numerical tools
- But its
Numeric package isn't bad - And it has a lot of things MATLAB doesn't
- This course isn't really about Python
- It's about solving simple problems with the least effort
- But we have to write the examples in something
- So we might as well choose something simple and useful
- For more information:
- Comment on this slide
Running Python Interactively
- Running a program in a sturdy language is a two-step process:
- Compiler translates source code into something that can run
- That “something” then runs
- Directly on the hardware (C, C++, Fortran)
- On a virtual machine (Java)
- Some combination of the two (C#)
- Nimble languages typically combine the compiler and the virtual machine
- A single program reads the user's code, translates it into something executable, and executes it right away
![[Sturdy vs. Nimble Execution]](img/py01/sturdy_vs_nimble_execution.png)
Figure 6.2: Sturdy vs. Nimble Execution |
- This means that most nimble languages can run interactively, like a shell
$ python
>>> 3 + 5
8
>>> x = 5 * 3 ** 2 # assign 5 times 3 squared to x
>>> print x
45
>>> print 'some' + 'thing' # concatenate strings
something
- Comment on this slide
Running Saved Programs
- Obviously don't have to retype program every time you want to run it
- Option 1: save program in a file with a
.py extension, and type python filename.py
- Python reads and executes the commands in the file exactly as if they'd been typed in interactively
- Option 2 (Unix only): make the following the first line of the
.py file- This tells Unix to look up a program called
python, and run it with the rest of the file as its input
- Option 3 (Windows only): associate
.py files with Python- Double-clicking on anything ending in
.py will then run it - Happens automatically when you run the Python Windows installer
- Example
- Using an editor, put the following in
powers.py print 2, 2**2, 2**3, 2**4
print 3, 3**2, 3**3, 3**4
print 4, 4**2, 4**3, 4**5
- Run using
python powers.py - Should see the following
2 4 8 16
3 9 27 81
4 16 64 1024
- Comment on this slide
Variables
- Variables are names for values
- No declarations: variables are created when something is assigned to them
![[Variables Refer to Values]](img/py01/vars_refer_to_values.png)
Figure 6.3: Variables Refer to Values |
- No types: a variable is just a name, and can refer to different types of values at different times
- Although your code will be easier to understand if you don't abuse this
- Must give a variable a value before using it
- Unlike some languages, Python doesn't try to guess a “sensible” value
print something
Traceback (most recent call last):
File "undefined.py", line 1, in ?
print something
NameError: name 'something' is not defined
- While variables don't have types, values do
- If you try to operate on incompatible values, Python will complain
x = 'two' # 'two' is a string
y = 2 # 2 is an integer
print x * y # multiplying a string concatenates it repeatedly
print x + y # but you can't add an integer and a string
twotwo
Traceback (most recent call last):
File "add_int_str.py", line 4, in ?
print x + y # but you can't add an integer and a string
TypeError: cannot concatenate 'str' and 'int' objects
- Comment on this slide
Printing and Quoting
- The
print statement prints zero or more values- Separated by spaces
- Automatically puts a newline at the end
- So
print on its own just prints a blank line
- Use either single or double quotes to create strings
- Must be consistent: if a string starts with one kind of quote, it must end with that kind of quote
- But different strings in the same program can use different kinds of quotes
- This was not necessarily a good design decision…
print 'a', "b", '"c"', "'d'"
a b "c" 'd'
- The built-in function
str converts things to stringsprint 'carbon-' + str(14)
carbon-14
- Use similar functions called
int, float, etc. to convert values to other types
- Use escape sequences to put special characters in strings
- Single quotes in a single-quoted string:
'This isn\'t unusual.' - Double quotes in a double-quoted string:
"She said, \"You can quote me on that!\"" - Tab and newline characters:
'\tIndented line\n'
- Comment on this slide
Numbers and Arithmetic
- The usual numeric types
14 is an integer (32 bits long on most machines)14.0 is floating point (double precision, i.e., 64 bits long)
- Two unusual numeric types
1+4j is a complex number (two 64-bit values)123456789L is a long integer- Arbitrary length: uses as much memory as it needs to
- Operations are several times slower
- Python automatically switches to long-integer mode when it needs to
- Python borrows C's numeric operators
| Name | Operator | Use | Value | Notes |
|---|
| Addition | + | 35 + 22 | 57 | |
| | 'Py' + 'thon' | 'Python' | |
| Subtraction | - | 35 - 22 | 13 | |
| Multiplication | * | 3 * 2 | 6 | |
| | 'Py' * 2 | 'PyPy' | 2 * 'Py' is illegal |
| Division | / | 3.0 / 2 | 1.5 | |
| | 3 / 2 | 1 | Integer division rounds down: -3 / 2 is -2, not -1 |
| Exponentiation | ** | 2 ** 0.5 | 1.4142135623730951 | |
| Remainder | % | 13 % 5 | 3 | |
Table 6.1: Numeric Operators in Python
- Python also supports C's in-place operators
x += 3 does the same thing as x = x + 35 += 3 is an error, since you can't assign a new value to 5…
- Comment on this slide
Booleans
True and False are true and false (d'oh)- Empty string, 0, and
None are equivalent to false- Just as 3 is equivalent to 3.0
- (Almost) everything else is true
- Combine Booleans using
and, or, not
and and or are short-circuit operators- Evaluate expressions left to right, and stop as soon as they know the answer
- Result is the last thing evaluated, rather than
True or False | Expression | Result | Notes |
|---|
True or False | True | |
True and False | False | |
'a' or 'b' | 'a' | or is true if either side is true, so it stops after evaluating 'a' |
'a' and 'b' | 'b' | and is only true if both sides are true, so it doesn't stop until it has evaluated 'b' |
0 or 'b' | 'b' | 0 is false, but 'b' is true |
0 and 'b' | 0 | Since 0 is false, Python can stop evaluating there |
0 and (1/0) | 0 | 1/0 would be an error, but Python never gets that far |
(x and 'set') or 'not set' | It depends | If x is true, this expression's value is 'set'; if x is false, it is 'not set' |
Table 6.2: Boolean Operators in Python
- Comment on this slide
Comparisons
- Python borrows C's comparison operators, too
- But allows you to chain comparisons together, just as in mathematics
| Expression | Value |
|---|
3 < 5 | True |
3.0 < 5 | True |
3 != 5 | True |
3 == 5 | False |
3 < 5 <= 7 | True |
3 < 5 >= 2 | True (but please don't write this—it's hard to read) |
3+2j < 5 | Error: can only use == and != on complex numbers |
Table 6.3: Comparison Operators in Python
- Note the difference between assignment and testing for equality
- Use a single equals sign
= for assignment - Use a double equals sign
== to test if two things have equal values
- String comparison may not do what you expect
- Characters are encoded as numbers: digits come before uppercase letters, all of which come before lowercase letters
- Punctuation is mixed in between, just to make matters difficult
- Strings are compared character by character from first to last until:
- One character is less than another
- One string runs out of characters
| Expression | Value |
|---|
'abc' < 'def' | True |
'abc' < 'Abc' | False |
'ab' < 'abc' | True |
'0' < '9' | True |
'100' < '2' | True |
Table 6.4: Python String Comparisons
- Comment on this slide
Conditionals
- Python uses
if, elif (not else if), and else - Use a colon and indentation to show nesting
a = 3
if a < 0:
print 'less'
elif a == 0:
print 'equal'
else:
print 'greater'
- Why indentation?
- Based on studies from the 1980s, it's what everyone looks at anyway
- Just count the number of warnings in C/Java books about misleading indentation
- Doesn't matter how much you use, but:
- Everything in the block must be indented the same amount
- Tabs are expanded so that they're equivalent to (up to) eight characters
- So don't ever indent with tabs, since your editor may interpret them differently
- Many editors understand Python indentation, and will help you get it right
- If you're fond of legacy character-oriented editors,
Emacs and Vim both work well - If you want a free integrated development environment (IDE), try
SPE or PyDev - If you'd prefer something with commercial support,
Komodo and WingIDE are my favorites
- Comment on this slide
While Loops, Break, and Continue
- Do something repeatedly as long as some condition is true
- Again, use colon and indentation to show nesting
a = 3
while a > 0:
print a
a -= 1
3
2
1
- Do the “something” zero times if the condition is false the first time it is tested
print 'before'
a = -1
while a > 0:
print a
a -=1
print 'after'
before
after
- If the condition is always true, the loop never ends
a = 3
while a > 0:
print a
# oops --- forgot to subtract one from a
3
3
3
- Can break out of the middle of a loop using
break
a = 30
while True:
print a
a -= 1
if (a % 5) == 0: # if a is divisible by 5
break
30
29
28
27
26
- Don't abuse this: put the test at the top of the loop unless there's a really good reason not to
- Can skip to the next iteration using
continue
a = 5
while a > 0:
print 'top of loop', a
a -= 1
if (a % 2) == 0:
continue
print '...bottom of loop', a
top of loop 5
top of loop 4
...bottom of loop 3
top of loop 3
top of loop 2
...bottom of loop 1
top of loop 1
- Again, don't abuse this by writing loops that are hard for other people to figure out
- If only because “other people” includes you three months from now
- Comment on this slide
Strings, Lists, and Files
Where We Just Were
- Python is a nimble language with:
- Numbers, strings, and Booleans
- Conditionals (
if, elif, and else) while loops (with break and continue)
- This lecture describes:
- Lists, for storing collections of values
- Functions, which reduce redundancy, and make programs easier to read
- Reading and writing files
- Comment on this slide
But First, Strings
- A string is an immutable sequence of characters
- Immutable means that it cannot be modified once it has been created
- I.e., you cannot change individual characters in place
str = 'abc'
print 'str is', str
str[0] = 'x'
print 'str is now', str
str is abc
Traceback (most recent call last):
File "immutable_err.py", line 3, in ?
str[0] = 'x'
TypeError: object does not support item assignment
- Though you can of course assign a new string value to a variable
str = 'abc'
print 'str is', str
str = 'xyz'
print 'str is now', str
str is abc
str is now xyz
- Sequence means that it can be indexed
- Indices start at 0 (as they do in C)…
- …so
text[0] is the first character of the string text - The built-in function
len returns the length of a string… - …so the index of the last character of
text is len(text)-1
- Note: there is no separate data type for characters
- A character is simply a string of length 1
element = "boron"
i = 0
while i < len(element):
print element[i]
i += 1
b
o
r
o
n
- Comment on this slide
Slicing, Bounds, and Negative Indices
text[start:end] takes a slice out of text
- Creates a new string containing the characters of
text from start up to (but not including) end val = "helium"
print val[1:3], val[:2], val[4:]
el he um
- Sometimes helps to think of indices as being between elements
![[Visualizing Indices]](img/py02/index_between.png)
Figure 7.1: Visualizing Indices |
- A few logical consequences:
text[1:2] is either the second character in text, or the empty string (if text doesn't have a second character)- So
text[1:1] is always the empty string- From index 1, up to but not including index 1
- And
text[2:1] is always the empty string
- Negative indices count backward from the end of the string
x[-1] is the last character- A lot easier to read than
x[len(x)-1]
x[-2] is the second-to-last character![[Visualizing Negative Indices]](img/py02/index_between_negative.png)
Figure 7.2: Visualizing Negative Indices |
val = "carbon"
print val[-2], val[-4], val[-6]
o r c
- Bounds checking rules:
- Python always does an out-of-bounds check when you index a single item
- But it truncates out-of-range indices when you take a slice
val = "helium"
print val[1:22]
x = val[22]
elium
Traceback (most recent call last):
File "bounds.py", line 3, in ?
x = val[22]
IndexError: string index out of range
- Comment on this slide
String Methods
- A method is a function that's tied to a particular object
- Invented to help programmers organize their code
- To call method
M of object X, type X.M() - We'll see how to create objects and methods of our own later
- Almost everything in Python has methods
- Numbers are the only important exception
- String methods
| Method | Purpose | Example | Result |
|---|
capitalize | Capitalize first letter of string | "text".capitalize() | "Text" |
lower | Convert all letters to lowercase. | "aBcD".lower() | "abcd" |
upper | Convert all letters to uppercase. | "aBcD".upper() | "ABCD" |
strip | Remove leading and trailing whitespace (blanks, tabs, newlines, etc.) | " a b ".strip() | "a b" |
lstrip | Remove whitespace at left (leading) edge of string. | " a b ".lstrip() | "a b " |
rstrip | Remove whitespace at right (trailing) edge of string. | " a b ".rstrip() | " a b" |
count | Count how many times one string appears in another. | "abracadabra".count("ra") | 2 |
find | Return the index of the first occurrence of one string in another, or -1. | "abracadabra".find("ra") | 2 |
| | "abracadabra".find("xyz") | -1 |
replace | Replace occurrences of one string with another. | "abracadabra".replace("ra", "-") | "ab-cadab-" |
Table 7.1: String Methods
- Notes
- These methods don't have to be called on constant strings
text = "Hogwarts"
print text.upper()
print text.replace("rts", "sh")
HOGWARTS
Hogwash
- These methods create new strings: they do not change the strings they're called on
text = "Hogwarts"
print "method result is", text.replace("rts", "sh")
print "variable text is", text
method result is Hogwash
variable text is Hogwarts
- You can chain method calls together, and mix with indexing, concatenation, etc.
- If the result of one method call is a string, you can immediately call a method on that result
text = "Hermione"
print ':' + text.upper()[4:7].center(10) + ':'
: ION :
- Use this in moderation: long chains of method calls are hard to read and debug
- You can use
is to check whether one string appears in anotherprint "atta" in "gatcattcgat"
print "tcga" in "gatcattcgat"
False
True
- Comment on this slide
Lists
- A list is a mutable sequence of objects
- Mutable means that, unlike a string, it can be changed in place
- Of objects means that lists can hold anything and everything
- Think of it as a one-dimensional array, or vector, that automatically resizes itself as needed
- Write lists using square brackets:
[], [3], [5, "b"]
- The empty list is considered false
- Indexing and slicing work almost the same way for strings and lists:
- Indexing or slicing a string always returns a string
- Indexing a list returns a single item
- Slicing a list returns a new list
x = ["a", 2, "bcd", 4]
print x[0]
print x[-1]
print x[1:-1]
a
4
[2, 'bcd']
- As before, Python always checks bounds on element access, but truncates bounds on slices
- Assign a new value to a list element using
x[i] = v
gases = ["helium", "argon", "neon"]
print gases
i = 0
while i < len(gases):
gases[i] = "changed"
print gases
i += 1
['helium', 'argon', 'neon']
['changed', 'argon', 'neon']
['changed', 'changed', 'neon']
['changed', 'changed', 'changed']
- Adding two (or more) lists concatenates their elements
- Creates a new list
left = ['L', 'E', 'F', 'T']
right = ['W', 'R', 'I', 'T', 'E']
both = left + right
print both
['L', 'E', 'F', 'T', 'W', 'R', 'I', 'T', 'E']
- Can't concatenate a string and a list
- But
list(text) creates a list whose elements are the characters of the string text
print list("Dumbledore")
['D', 'u', 'm', 'b', 'l', 'e', 'd', 'o', 'r', 'e']
del deletes something from the list, and shortens the list- The “something” can be a slice, in which case several items are deleted
values = ['a', 'b', 'c', 'd', 'e']
print 'original', values
del values[2]
print 'after deleting item 2', values
del values[-2:]
print 'after deleting last two remaining items', values
original ['a', 'b', 'c', 'd', 'e']
after deleting item 2 ['a', 'b', 'd', 'e']
after deleting last two remaining items ['a', 'b']
- Note:
del modifies the list, and doesn't return anything
- Comment on this slide
List Methods
- Like strings, lists are objects, and have methods
- In the examples below, assume
x is initially ['c', 'a', 'b', 'c'] | Method | Purpose | Example | Result |
|---|
append | Add to the end of the list. | x.append(99) | ['c', 'a', 'b', 'c', 99] |
count | Count how many times something appears in the list. | x.count('c') | 2 |
index | Find the first occurrence of something in the list. | x.index('a') | 1 |
insert | Insert something into the list. | x.insert(2, 'diamond') | ['c', 'a', 'diamond', 'b', 'c'] |
remove | Remove the first occurrence of something from the list. | x.remove('c') | ['a', 'b', 'c'] |
reverse | Reverse the list in place. | x.reverse() | ['c', 'b', 'a', 'c'] |
sort | Sort the list in place. | x.sort() | ['a', 'b', 'c', 'c'] |
Table 7.2: List Methods
- Notes
- It's an error if
list.index can't find the item it's searching for- The next lecture describes how to handle errors
list.reverse and list.sort change the list, and return None
x = x.reverse() is a common error- It reverses
x, but then sets x to None, so all data is lost
- Comment on this slide
For Loops and Ranges
- Python's
for statement automatically loops over the content of a collectionfor x in text sets x to each character of text in turnfor x in someList sets x to each element of someList in turnfor c in 'lead':
print '+' + c + '+',
print
for x in ['helium', 'argon', 'neon']:
print x.capitalize()
+l+ +e+ +a+ +d+
Helium
Argon
Neon
- The trailing comma in the
print statement suppresses the automatic newline
- The built-in function
range creates the list [start, start+1, ..., end-1]
- Note:
end-1, just as x[start:end] is the elements of x up to, but not including, end range(end) is the same as range(0, end)range(start, end, step) goes in increments of step- May generate an empty list
print range(2, 5)
print range(3)
print range(2, 10, 3)
print range(3, 1)
[2, 3, 4]
[0, 1, 2]
[2, 5, 8]
[]
- So:
- To loop from 0 to N-1, use
for i in range(N) - To loop over the indices of a list or string, use
for i in range(len(sequence)) chars = "abc"
for i in range(len(chars)):
print i, chars[i]
0 a
1 b
2 c
- Comment on this slide
Membership
x in c is True if the value x is in the collection c
- Equivalent to a substring test on strings
'x' in 'Harry' is False'ar' in 'Harry' is True
- Works element-by-element on lists
3 in [1, 2, 3, 4] is True[2, 3] in [1, 2, 3, 4] is False
vowels = 'aeiou'
for v in vowels:
if v in 'uranium':
print v
a
i
u
- Comment on this slide
Nesting Lists
- Since lists can contain anything, you can build lists of lists of lists of…
- Example: use a list of two values to represent a point, and a list of points to represent a line segment
![[Line Segment]](img/py02/line_segment.png)
Figure 7.3: Line Segment |
- Write indices left to right to select elements from the outside in
x = [[13, 17, 19], [23, 29]]
print x[1]
print x[0][1:3]
[23, 29]
[17, 19]
- The nested lists are objects in their own right
- The outer list stores a reference to the inner list
- But the inner list does not know that it's being referred to
- Object in Python (almost) never know who's pointing at them
- When you index the outer list, you get an alias for the inner list
- Another name for the same data
- Changes made through either reference update the same data
x = [['a', 'b'], ['c', 'd', 'e']]
y = x[0]
print 'before: nested list is', x
print '...and sublist is', y
y[0] = 123
print 'after: nested list is', x
print '...and sublist is', y
before: nested list is [['a', 'b'], ['c', 'd', 'e']]
...and sublist is ['a', 'b']
after: nested list is [[123, 'b'], ['c', 'd', 'e']]
...and sublist is [123, 'b']
![[Aliasing In Action]](img/py02/aliasing.png)
Figure 7.4: Aliasing In Action |
- I lied to you earlier when I said that indexing and slicing does the same thing on lists that it does on strings
- The technical term is “pedagogically motivated misdirection”
- Indexing returns whatever the list itself contains
- In the case of a nested list, this is a reference to the sublist
- Slicing returns a new list containing the selected elements of the original list
- So changes to a slice do not affect the original list
- Can't tell the difference with strings, since there's no way to update one in place anyway
x = [['a', 'b'], ['c', 'd', 'e'], ['f', 'g']]
y = x[0:2]
print 'before: nested list is', x
print '...and slice is', y
y[0] = 123
print 'after: nested list is', x
print '...and modified slice is', y
before: nested list is [['a', 'b'], ['c', 'd', 'e'], ['f', 'g']]
...and slice is [['a', 'b'], ['c', 'd', 'e']]
after: nested list is [['a', 'b'], ['c', 'd', 'e'], ['f', 'g']]
...and modified slice is [123, ['c', 'd', 'e']]
![[Slicing Copies Data]](img/py02/slice_copy.png)
Figure 7.5: Slicing Copies Data |
- Yes, it can be confusing…
- …but anything else would be unacceptably slow…
- …so draw diagrams when you need to…
- …and don't write overly-complicated code in the first place
- Comment on this slide
Tuples
- Python actually has a second type of list, called a tuple
- Just like a normal list, but immutable (i.e., can't be changed after creation)
- Written using parentheses instead of square brackets:
(1, 2, 3) instead of [1, 2, 3] - Empty tuple is
() - Tuple with one element must be written with a comma, as in
(55,)
- Because
(55) has to be just the integer 55, or the mathematicians will get upset
- Why bother?
- Performance: if you know the length of the sequence isn't going to change, you can do some things more quickly
- There are places where Python needs immutable objects that contain other values
- We'll meet them in the next lecture
- I still think it's a wart…
- Tuples allow Python to provide multi-valued assignment
a, b = 2, 3 assigns 2 to a, and 3 to b
- Equivalent to
a, b = (2, 3)
a, b, c = [1, 2, 3] does what you'd expecta, b = [1, 2, 3, 4] is an error- Number of targets on the left must match the number of values on the right
a, b = b, a swaps the values in a and b
- Use multi-valued assignment in
for loops to unpack structures on the flycorners = [[1, 3], [2, 7], [4, 0], [4, 9]]
sumX, sumY = 0, 0
for cX, cY in corners:
sumX += cX
sumY += cY
sumX = float(sumX) / len(corners)
sumY = float(sumY) / len(corners)
print "center is", sumX, sumY
center is 2.75 4.75
- Comment on this slide
Files
- Use the built-in function
open() to open a file- First argument is path
- Second is
'r' (for read) or 'w' for write - Result is a file object with methods for input and output
infile = open('file.txt', 'r')
outfile = open('copy.txt', 'w')
line = infile.readline()
while line:
outfile.write(line)
line = infile.readline()
infile.close()
outfile.close()
- Here's what just happened:
- First statement opens
file.txt for reading, and assigns the file object to input - Second statement opens
copy.txt for writing, and assigns the file object to output - Program then tries to read a line from
input
- If the file is empty,
line is assigned the value None - Like
Null in C, or null in Java, it means “nothing here”, and is considered false
- As long as there are lines in the input file, the program:
- Writes the most recent line to
output - Reads the next line
- When the input file is exhausted, the program closes
input and output
- Python will close the files automatically when the program exits…
- …but it's good practice to tidy up your toys when your done playing with them
- Comment on this slide
Other Ways to Do It
- Could also copy files like this:
infile = open('file.txt', 'r')
outfile = open('copy.txt', 'w')
for line in infile:
outfile.write(line)
infile.close()
outfile.close()
- Or this:
infile = open('file.txt', 'r')
contents = infile.readlines()
infile.close()
outfile = open('copy.txt', 'w')
outfile.writelines(contents)
outfile.close()
- Moral of the story: get to know the standard library of whatever language you're using
- A week of hard work can sometimes save you an hour of reading, too…
- Comment on this slide
Exercises
Exercise 7.1:
What does "aaaaa".count("aaa") return? Why?
Exercise 7.2:
What does the built-in function enumerate do? Use it to
write a function called findOver that takes a list of numbers
called values, and a number called threshold, as
arguments, and returns a list of the locations where items in
values are greater than threshold. For example,
findOver([1.1, 3.8, -1.6, 7.4], 2.0) should return [1, 3],
since the values in the input list at locations 1 and 3 are
greater than the threshold 2.0.
Exercise 7.3:
What do each of the following five code fragments do? Why?
x = ['a', 'b', 'c', 'd']
x[0:2] = []
|
x = ['a', 'b', 'c', 'd']
x[0:2] = ['q']
|
x = ['a', 'b', 'c', 'd']
x[0:2] = 'q'
|
x = ['a', 'b', 'c', 'd']
x[0:2] = 99
|
x = ['a', 'b', 'c', 'd']
x[0:2] = [99]
|
Exercise 7.4:
What does 'a'.join(['b', 'c', 'd']) return? If you have
a list of strings, how can you concatenate them in a single statement?
Why do you think join is written this way, rather than as
['b', 'c', 'd'].join('a')?
Functions, Libraries, and the File System
Where We Just Were
- Python has:
- The usual primitive data types (numbers, strings, Booleans)
- The usual ways to combine statements (loops and conditionals)
- Lists for storing collections of data
- This lecture describes:
- Functions, for reusing code, and making it easier to understand
- Modules, for building libraries
- Four commonly-used libraries
- Comment on this slide
Defining Functions
- Define a new function using
def
- Argument names follow in parentheses
# Define function.
def ave(x, y):
return (x + y) / 2.0
# Use function.
print ave(20, 30)
25.0
- Exit at any time with
return
def sign(x):
if x < 0:
return -1
if x == 0:
return 0
return 1
for i in [-17.0, 33.3, 0.0]:
print i, sign(i)
-17.0 -1
33.3 1
0.0 0
- Functions with lots of
return statements tend to be hard to understand
- Every function returns something
return on its own is the same as return None- Functions without
return statements also return None
list.sort and list.reverse are examples
def double_elt(x):
for i in range(len(x)):
x[i] = x[i] * 2
values = [3, 'xyz', [9]]
print "values before call:", values
result = double_elt(values)
print "result of call:", result
print "values after call:", values
values before call: [3, 'xyz', [9]]
result of call: None
values after call: [6, 'xyzxyz', [9, 9]]
- Comment on this slide
Scope
- Variables created in function are local to it
- Fresh copies created for each call
- Values discarded when the function returns
- As always, variables must be defined before they can be used
- Python manages variables using a call stack
# Global variable called 'x'.
x = 123
# Function that defines a local variable called 'x'.
def f(arg):
x = arg
print "in call, x is", x
# Call the function to prove that it uses its local 'x'.
print "before call, global x is", x
f(999)
print "after call, global x is", x
before call, global x is 123
in call, x is 999
after call, global x is 123
![[Call Stack (a)]](img/py03/call_stack_a.png)
Figure 8.1: Call Stack (a) |
![[Call Stack (b)]](img/py03/call_stack_b.png)
Figure 8.2: Call Stack (b) |
![[Call Stack (c)]](img/py03/call_stack_c.png)
Figure 8.3: Call Stack (c) |
- Each time a function is called, Python creates new variables, and puts them on top of the stack
- When a variable is referenced, Python looks at the top stack frame, then at global variables
- Unlike some languages, does not search every frame in the stack
- Comment on this slide
Parameter Passing Rules
- Python copies variables' values when passing them to functions
- Since Booleans, numbers, and strings can't be updated, a function can't affect its caller's values
- But if you pass a list to a function, the function will operate on the original list, not a copy
def mutate(x, y):
x = 0
y[0] = "modified"
single = 1
triple = [1, 2, 3]
print "before call, single is", single, "and triple is", triple
mutate(single, triple)
print "after call, single is", single, "and triple is", triple
before call, single is 1 and triple is [1, 2, 3]
after call, single is 1 and triple is ['modified', 2, 3]
![[Argument Passing (a)]](img/py03/arg_ref_a.png)
Figure 8.4: Argument Passing (a) |
![[Argument Passing (b)]](img/py03/arg_ref_b.png)
Figure 8.5: Argument Passing (b) |
![[Argument Passing (c)]](img/py03/arg_ref_c.png)
Figure 8.6: Argument Passing (c) |
![[Argument Passing (d)]](img/py03/arg_ref_d.png)
Figure 8.7: Argument Passing (d) |
- If you want to pass a copy of a list into
mutate, use mutate(single, triple[:])
triple[:] is the same as triple[0:len(triple)]…- …which is a slice of
triple that includes the entire list… - …and slicing creates a new list
![[Passing a Slice]](img/py03/pass_slice.png)
Figure 8.8: Passing a Slice |
- Comment on this slide
Default Parameter Values
- You can specify default values for parameters when defining a function
- Just “assign” some constant to the parameter in the definition
- The parameters actually passed when the function is called are matched up left to right
- So put the parameters the user is least likely to want to change at the end of the parameter list
def total(values, start=0, end=None):
# If no values given, total is zero.
if not values:
return 0
# If no end specified, use the entire sequence.
if end is None:
end = len(values)
# Calculate.
result = 0
for i in range(start, end):
result += values[i]
return result
numbers being added [10, 20, 30, 40, 50, 60]
total(numbers, 0, 3) 60
total(numbers, 4) 110
total(numbers) 210
- All arguments with defaults must come after all arguments without them
- Otherwise, matching values to parameters would be ambiguous
def total(values, start=0, end):
# If no values given, total is zero.
if not values:
return 0
# Calculate.
result = 0
for i in range(start, end):
result += values[i]
return result
- Comment on this slide
Extra Arguments
- If a function has been defined the right way, you can call it with extra arguments
- Last parameter in definition must have
* in front of its name- Doesn't matter what it's called, but most people use
*extra
- Any “leftover” parameters are put in a tuple and passed via this argument
- Remember, a tuple is an immutable (unmodifiable) list
- Can only be one
* argument per function- If there were two or more, how would Python know which extra values to put in which?
def show(first, *extra):
print first, extra
show(10)
show(10, 20, 30)
show(10, 'text', ['a', 'list'])
10 ()
10 (20, 30)
10 ('text', ['a', 'list'])
- Comment on this slide
Functions Are Objects
- Once it has been translated into instructions, a function is just another object
- Happens to be an object you can call, just as strings and lists happens to be objects you can index
def is just a shorthand for “create a function, and assign it to a variable”Pi = 3.14
def circum(r):
return 2 * Pi * r
for x in [1.0, 2.0, 3.0]:
print x, circum(x)
![[Functions As Objects]](img/py03/function_object.png)
Figure 8.9: Functions As Objects |
- This means you can:
- Redefine functions (just as you can reassign values to variables)
- Create aliases for functions
- Pass functions as parameters
- Store functions in lists
- Example: apply a function to each value in a list
# Apply a function to each value in a list.
def applyToList(func, values):
result = []
for x in values:
y = func(x)
result.append(y)
return result
# A sample function (calculates 1/3 of value).
def oneThird(a):
return a/3.0
# Test.
values = [0.0, 0.5, 0.75, 0.875, 0.9375]
output = applyToList(oneThird, values)
for i in range(len(values)):
print values[i], "=>", output[i]
0.0 => 0.0
0.5 => 0.166666666667
0.75 => 0.25
0.875 => 0.291666666667
0.9375 => 0.3125
- Example: apply several functions to a single value
# Apply each function in a list to a value.
def applyEach(functions, value):
result = []
for f in functions:
y = f(value)
result.append(y)
return result
# One half.
def oneHalf(a):
return a/2.0
# One third.
def oneThird(a):
return a/3.0
# One quarter.
def oneQuarter(a):
return a/4.0
# Test.
functions = [oneHalf, oneThird, oneQuarter]
output = applyEach(functions, 0.25)
for i in range(len(functions)):
print functions[i].__name__, output[i]
oneHalf 0.125
oneThird 0.0833333333333
oneQuarter 0.0625
- Note: every function has an attribute called
__name__, which is the name it was originally defined under # An example function.
def original(left, right):
return (left + right) / 2.0
# Create an alias for it.
alias = original
# Call them directly.
print "calling original:", original.__name__, original(3, 4)
print "calling alias:", alias.__name__, alias(3, 4)
# Get a little fancy.
print "Calling in loop..."
for f in [original, alias]:
print f.__name__, f(5, 8)
calling original: original 3.5
calling alias: original 3.5
Calling in loop...
original 6.5
original 6.5
- Comment on this slide
Creating Modules
- Every Python file is automatically also a module (or library)
- If the file is called
xyz.py, load it using import xyz
- The statements in the module are executed as it is loaded
- Assignment and
def are statements - You can use conditionals, loops, and anything else, too
- Refer to things in a module using
module.thing
- Put this in
mylib.py # A constant.
value = 123
# Remember, modules are executed as they're loaded...
if value < 200:
size = 'small'
else:
size = 'large'
# A function.
def printVersion():
print 'Stuff Version 2.2'
- Use it by putting this in
program.py # Load the contents of mylib.py.
import mylib
# Define our own version of value.
value = "something else entirely"
# Show mylib's values.
print 'mylib.value', mylib.value
print 'mylib.size', mylib.size
# Show out own values.
print 'own value:', value
# Call function.
mylib.printVersion()
- When
program.py runs, it prints this # Load the contents of mylib.py.
import mylib
# Define our own version of value.
value = "something else entirely"
# Show mylib's values.
print 'mylib.value', mylib.value
print 'mylib.size', mylib.size
# Show out own values.
print 'own value:', value
# Call function.
mylib.printVersion()
mylib.value 123
mylib.size small
own value: something else entirely
Stuff Version 2.2
- Notice that both
mylib.py and program.py define value
- Each module is its own global scope
- When something in
mylib.py references value, it always gets its own value
- You can also use:
import mylib as m, then call m.printVersion()from mylib import printVersion, then call printVersion()
- Question: what would happen if you did
from mylib import value?
from mylib import * imports everything from mylib
- Comment on this slide
The Math Library
- Python is a small language with a large library
- Much of the standard library is just Python interfaces to standard C libraries
- Example: the
math library| Type | Name | Purpose | Example | Result |
|---|
| Constant | e | Constant | e | 2.71828… |
| pi | Constant | pi | 3.14159… |
| Function | ceil | Ceiling | ceil(2.5) | 3.0 |
| floor | Floor | floor(-2.5) | -3.0 |
| exp | Exponential | exp(1.0) | 2.71828… |
| log | Logarithm | log(4.0) | 1.38629… |
| | log(4.0, 2.0) | 2.0 |
| log10 | Base-10 logarithm | log10(4.0) | 0.60205… |
| pow | Power | pow(2.5, 2.0) | 6.25 |
| sqrt | Square root | sqrt(9.0) | 3.0 |
| cos | Cosine | cos(pi) | -1.0 |
| asin | Arc sine | asin(-1.0) | -1.5707… |
| hypot | Euclidean norm x2 + y2 | hypot(2, 3) | 3.60555… |
| degrees | Convert from radians to degrees | degrees(pi) | 180 |
| radians | Convert from degrees to radians | radians(45) | 0.78539… |
Table 8.1: The Python Standard Math Library
- All the other trigonometric functions (
tan, arctan, etc.) are also there abs is built in to the language: don't have to import anything to use it
- Comment on this slide
The System Library
- Most commonly used library in Python is the system library
sys
- Information about Python (e.g., version number and copyright notice)
- Information about the environment (e.g., what operating system the program is running on)
- Advanced features that mere mortals should never meddle with
- By convention, always imported with
import sys
- Members referred to as
sys.x, rather than just x
- Contents
| Type | Name | Purpose | Example | Result |
|---|
| Data | argv | The program's command line arguments | argv[0] | "myscript.py" (or whatever your program is called) |
| maxint | Largest positive value that can be represented by Python's basic integer type | sys.maxint | 2147483647 |
| path | List of directories that Python searches when importing modules | sys.path | ['/home/greg/pylib', '/Python24/lib', '/Python24/lib/site-packages'] |
| platform | What type of operating system Python is running on | sys.platform | "win32" |
| stdin | Standard input | sys.stdin.readline() | (Typically) the next line of input from the keyboard |
| stdout | Standard output | sys.stdout.write('****') | (Typically) print four stars to the screen |
| stderr | Standard error | sys.stderr.write('Program crashing!\n') | Print an error message to the screen |
| version | What version of Python this is | sys.version | "2.4 (#60, Feb 9 2005, 19:03:27) [MSC v.1310 32 bit (Intel)]" |
| Function | exit | Exit from Python, returning a status code to the operating system | sys.exit(0) | Terminates program with status 0 |
Table 8.2: The Python System Library
sys.argv is automatically initialized to the program's command-line arguments- Program's name is always
sys.argv[0] - Example:
python program.py -d noisy gets ['program.py', '-d', 'noisy']
sys.path is the list of places Python is allowed to look to find modules for import- Initialized from the
PYTHONPATH environment variable- Directory containing the program being run is automatically put at the start of this list
- If
sys.path is ['/home/swc/exmpl/py03', '/home/swc/lib', '/Python24/lib'], then import quark will look for:
/home/swc/exmpl/py03/quark.py/home/swc/lib/quark.py/Python24/lib/quark.py- then fail
sys.stdin and sys.stdout are normally connected to the keyboard and screen- If you redirect input, or are using a pipe,
sys.stdin reads from a file or another program- In
python reader.py < somefile.txt, reader.py's sys.stdin reads from somefile.txt
- If you redirect output, or are using a pipe,
sys.stdout writes to a file or another programpython writer.py | sort | uniq sorts the output of writer.py and removes duplicate lines
sys.exit terminates the program, and returns an integer status code to the operating system- 0 indicates successful execution (“zero errors”)
- Non-zero is an error code
- Please be polite and return 1 or -1 if you detect an error in your program
- Yes, it's the opposite of what you'd expect…
- If you don't exit explicitly, Python returns 0
- Comment on this slide
Times
- Can't delete old files, or check how long your program has been running, without some notion of time
- Terminology:
- The epoch is the moment from which time is measured. On Unix, the epoch is midnight, January 1, 1970; time is measured in seconds since then.
- Coordinated Universal Time, or UTC, is the official time planet Earth. It used to be called Greenwich Mean Time (GMT).
- Local time is what the clock on your wall shows. It is defined to be UTC, plus or minus a constant offset determined by your time zone, possibly modified by daylight savings time.
- A leap year is a year that has an extra day (to keep the calendar in synch with Earth's orbit around the sun)
- A leap second is an extra second added to a day to keep UTC in synch with Earth's rotation
- Sometimes, two leap seconds have to be added.
- A time structure is an object with nine fields, representing a time in a more comprehensible format.
tm_year: year (e.g., 2005)tm_mon: month (integer in the range [1, 12])tm_mday: day of the month (integer in the range [1, 31])tm_hour: hour of the day (integer in the range [0, 23])tm_min: minute of the hour (integer in the range [0, 59])tm_sec: second of the minute (integer in the range [0, 61])tm_wday: week day (integer in the range [0, 6], with Monday being 0)tm_yday: day of the year (integer in the range [1, 366])tm_isdst: daylight savings time flag
- Contents of
time module (presented out of order so that they'll make more sense):
| Type | Name | Purpose | Example | Result |
|---|
| Function | time | The current time, in seconds since the epoch. | time.time() | 1111941637.0929999 (on March 27, 2005, at 11:40:39 EST) |
| gmtime | Convert a time (in seconds since the epoch) to a time structure representing the time in UTC. | time.gmtime(1111941637) | A structure with s.tm_year equal to 2005, s.tm_hour equal to 16, etc. |
| localtime | Like gmtime, but returns the local time. | time.localtime(1111941637) | A structure with s.tm_hou equal to 11 instead of 16. |
| ctime | Convert a time (in seconds since the epoch) to a readable string in local time. | time.ctime(1111941637) | "Sun Mar 27 11:40:37 2005" |
| asctime | Convert a time structure to a readable string in UTC. | time.asctime(time.gmtime(1111941637)) | "Sun Mar 27 16:40:37 2005" |
| mktime | Convert a time structure (or a 9-element tuple) to seconds since the epoch in the local time zone. | time.mktime((1970, 1, 1, 0, 0, 0, 0, 0, 0)) | 18000.0 (in Eastern Standard Time) |
| sleep | Pause the program for the specified number of seconds. | time.sleep(5) | Nothing, for five seconds. |
Table 8.3: The Python Time Library
- The names and behavior of these functions are taken from the standard C library
- Less consistent than one might want
- For example, no equivalent of
time.mktime for converting to UTC
- Always store and manipulate times in UTC
- Especially important in a world in which the telescope could be in Hawaii, and the database in Paris.
- Do not try to write calendar calculations yourself
- It's a lot harder than you think
- Use the
datetime and calendar modules
- Comment on this slide
Working with the File System
- The
os module contains a lot of functions for interacting with the operating system- The table below is only a fraction of them
| Type | Name | Purpose | Example | Result |
|---|
| Constant | curdir | The symbolic name for the current directory. | os.curdir | . on Linux. |
| pardir | The symbolic name for the parent directory. | os.pardir | .. on Linux. |
| sep | The separator character used in paths. | os.sep | / on Linux, \ on Windows. |
| linesep | The end-of-line marker used in text files. | os.linesep | \n on Linux, \r\n on Windows. |
| Function | listdir | List the contents of a directory. | os.listdir('/tmp') | The names of all the files and directories in /tmp (except . and ..). |
| mkdir | Create a new directory. | os.mkdir('/tmp/scratch') | Make the directory /tmp/scratch. Use os.makedirs to make several directories at once. |
| remove | Delete a file. | os.remove('/tmp/workingfile.txt') | Delete the file /tmp/workingfile.txt. |
| rename | Rename (or move) a file or directory. | os.rename('/tmp/scratch.txt', '/home/swc/data/important.txt') | Move the file /tmp/scratch.txt to /home/swc/data/important.txt. |
| rmdir | Remove a directory. | os.rmdir('/home/swc') | Probably not something you want to do… Use os.removedirs to remove several directories at once. |
| stat | Get information about a file or directory. | os.stat('/home/swc/data/important.txt') | Find out when important.txt was created, how large it is, etc. |
Table 8.4: The Python File System Library
- Note:
os.stat returns an object with at least ten members, the most important of which are:
st_size: size in bytesst_atime: time of most recent accessst_mtime: time of most recent modification- These times are in seconds since the epoch
import time, os
info = os.stat('stat_file.py')
print 'size:', info.st_size
print 'access time:', time.ctime(info.st_atime)
print 'modification time:', time.ctime(info.st_mtime)
size: 177
access time: Sat Jul 2 07:16:46 2005
modification time: Mon Jun 13 09:31:10 2005
- Comment on this slide
Manipulating Pathnames
- The
os module has a submodule called os.path
- Manipulate pathnames (filenames and directory names) correctly and efficiently
- Do not write your own functions for this—the rules are trickier than you think
- Used at least as often as
os itself
- Contents:
| Type | Name | Purpose | Example | Result |
|---|
| Function | abspath | Create normalized absolute pathnames. | os.path.abspath('../jeevan/bin/script.py') | /home/jeevan/bin/script.py (if executed in /home/gvwilson) |
| basename | Return the last portion of a path (i.e., the filename, or the last directory name). | os.path.basename('/tmp/scratch/junk.data') | junk.data |
| dirname | Return all but the last portion of a path. | os.path.basename('/tmp/scratch/junk.data') | /tmp/scratch |
| exists | Return True if a pathname refers to an existing file or directory. | os.path.exists('./scribble.txt') | True if there is a file called scribble.txt in the current working directory, False otherwise. |
| getatime | Get the last access time of a file or directory (like os.stat). | os.path.getatime('.') | 1112109573 (which means that the current directory was last read or written at 10:19:33 EST on March 29, 2005). |
| getmtime | Get the last modification time of a file or directory (like os.stat). | os.path.getmtime('.') | 1112109502 (which means that the current directory was last modified 71 seconds before the time shown above). |
| getsize | Get the size of something in bytes (like os.stat). | os.path.getsize('py03.swc') | 29662. |
| isabs | True if its argument is an absolute pathname. | os.path.isabs('tmp/data.txt') | False |
| isfile | True if its argument identifies an existing file. | os.path.isfile('tmp/data.txt') | True if a file called ./tmp/data.txt exists, and False otherwise. |
| isdir | True if its argument identifies an existing directory.. | os.path.isdir('tmp') | True if the current directory has a subdirectory called tmp. |
| join | Join pathname fragments to create a full pathname. | os.path.join('/tmp', 'scratch', 'data.txt') | "/tmp/scratch/data.txt" |
| normpath | Normalize a pathname (i.e., remove redundant slashes, uses of . and .., etc.). | os.path.normpath('tmp/scratch/../other/file.txt') | "tmp/other/file.txt" |
| split | Return both of the values returned by os.path.dirname and os.path.basename. | os.path.split('/tmp/scratch.dat') | ('/tmp', 'scratch.dat') |
| splitext | Split a path into two pieces root and ext, such that ext is the last piece beginning with a ".". | os.path.splitext('/tmp/scratch.dat') | ('/tmp/scratch', '.dat') |
Table 8.5: Python's Pathname Library
- Comment on this slide
Knowing Where You Are
- Python assigns the module's name to the special variable
__name__
"__main__" when the file is run from the command line- The module's name when it is loaded by something else
- Often used to draw attention to the main body of a large program
- Or to include self-tests in module
- If the module is loaded by something else, its name isn't
"__main__", so the tests aren't run
- Comment on this slide
Where to Learn More
Exercises
Exercise 8.1:
Write a function that takes two strings called text and
fragment as arguments, and returns the number of times
fragment appears in the second half of text. Your
function must not create a copy of the second half of
text. (Hint: read the documentation for string.count.)
Exercise 8.2:
What does the Python keyword global do?
What are some reasons not to write code that uses it?
Exercise 8.3:
Consider the following sample of code and its output:
def settings(first, **rest):
print 'first is', first
print 'rest is'
for (name, value) in rest.items():
print '...', name, value
print
settings(1)
settings(1, two=2, three="THREE")
first is 1
rest is
first is 1
rest is
... two 2
... three THREE
What does the variable rest do? What does the double
asterisk ** in front of its name mean? How does it compare
to the example with *extra (with a single asterisk) in the
lecture?
Exercise 8.4:
Python allows you to import all the functions and variables in a
module at once, making them local name. For example, if the
module is called values, and contains a variable called
Threshold and a function called limit, then after
the statement from values import *, you can then refer
directly to Threshold and limit, rather than having
to use values.Threshold or values.limit. Explain
why this is generally considered a bad thing to do, even though it
reduces the amount programmers have to type.
Exercise 8.5:
sys.stdin, sys.stdout, and sys.stderr are
variables, which means that you can assign to them. For example,
if you want to change where print sends its output, you can
do this:
import sys
print 'this goes to stdout'
temp = sys.stdout
sys.stdout = open('temporary.txt', 'w')
print 'this goes to temporary.txt'
sys.stdout = temp
Do you think this is a good programming practice? When and why
do you think its use might be justified?
Exercise 8.6:
os.stat(path) returns an object whose members describe
various properties of the file or directory identified by
path. Using this, write a function that will determine
whether or not a file is more than one year old.
Exercise 8.7:
Write a Python program that takes as its arguments two years (such
as 1997 and 2007), prints out the number of days between the 15th
of each month from January of the first year until December of the
last year.
Exercise 8.8:
Write a simple version of which in Python.
Your program should check each directory on the caller's path
(in order) to find an executable program that has the name given
to it on the command line.
Testing Basics
Motivation
- Testing software is just as important as calibrating experimental equipment, or checking the purity of reagants
- But it's done far less often
- Reason #1: people don't know how
- Reason #2: low standards can perpetuate themselves until someone decides to make quality an issue
- American cars in the 1970s
- Study after study has proved that the more time you spend testing, the less total time it takes to develop working software
- So long as you know how to test systematically…
- …and use the right tools to help you…
- …and design your software to be testable
- Studies also show that how well a team tests is a good predictor of how good they are at meeting schedule and quality targets
- Because discipline matters more than genius
- Comment on this slide
Terminology
- Unit testing (or component testing): testing one component in isolation
- E.g., test that all of a class's methods do what they're supposed to
- System testing (or integration testing): testing that components work properly in combination
- E.g., make sure that when X calls Y, it passes parameters in the right order
- These are two ends of a spectrum
- Most classes rely on several others, and so cannot be tested in isolation
- Often pursue a bottom-up testing strategy
- Test the simple classes first, then the ones that use them, then the ones that use those
- Often makes sense to test subsystems (made up of several classes), rather than the entire program
- E.g., test the voice recognition module
- Sometimes have to build scaffolding to support testing
- Like the scaffolding on a building site, it's extra code that is used during the construction process, but not delivered to customers
- Regression testing: rerunning tests periodically to see whether the code has regressed
- I.e., whether things that used to work are now broken
- Every time a change is made to the system, the programmer making the change should rerun the regression tests before checking her changes into version control
- Never slow other people down by breaking things that they're relying on
- A large program that doesn't have regression tests is difficult (sometimes impossible) to maintain
- The only way to know whether a change to one part will break something somewhere else is to run the whole program
- A fixture is a thing to be tested
- E.g., an object, a data structure, a complete database
- A test is something you do to a fixture
- A test result can be one of:
- Pass: the actual outcome is what was expected
- Fail: the actual outcome is different from what was expected
- Error: something went wrong with the test itself (i.e., the test contains a bug)
- Don't know anything about the system being tested
- Comment on this slide
Example: Rectangle Overlap
- Want to test a function
overlap that calculates the overlap between two rectangles- Each rectangle is represented by an instance of the class
Rect
- Constructor takes four non-negative integer arguments (low and high XY coordinates)
- Class has methods
getLoX, getLoY, getHiX, and getHiY - Rectangle is guaranteed to have non-zero area
- Function returns:
- A new rectangle representing the overlap of its inputs if they overlap
None if there's no overlap- Note: since rectangles have to have non-zero area, touching on an edge or corner does not count as overlap
- Analysis
- This is unit testing (one thing in isolation)
- Assume for the moment that
Rect is correct - I.e., that it has been tested elsewhere
- Each fixture will be a pair of rectangles
- The test will be to pass them to
overlap, and see if the output is correct
- First version of
overlap:
def overlap(left, right):
'''Calculate overlap between two rectangles.'''
loX = max(left.getLoX(), right.getLoX())
hiX = min(left.getHiX(), right.getHiX())
loY = max(left.getLoY(), right.getLoY())
hiY = min(left.getHiY(), right.getHiY())
if (loX >= hiX) or (loY > hiY):
return None
return Rect(loX, loY, loX, hiY)
- How many errors can you spot?
- First test passes:
r1 = Rect(0, 0, 1, 1)
r2 = Rect(2, 2, 3, 3)
result = overlap(r1, r2)
if result == None:
print 'pass'
else:
print 'fail'
pass
- But not the second one:
r1 = Rect(0, 0, 1, 1)
r2 = Rect(0, 1, 3, 3)
result = overlap(r1, r2)
if result == None:
print 'pass'
else:
print 'fail'
Traceback (most recent call last):
File "overlap.py", line 97, in ?
result = overlap(r1, r2)
File "overlap.py", line 64, in overlap
return Rect(loX, loY, hiX, hiY)
File "overlap.py", line 9, in __init__
assert 0 <= loY < hiY, 'illegal Y coordinates'
AssertionError: illegal Y coordinates
- This is an error in the test
- A good test will always run, no matter what state the code is in
- Revise the test and rerun:
try:
r1 = Rect(0, 0, 1, 1)
r2 = Rect(0, 1, 3, 3)
result = overlap(r1, r2)
if result == None:
print 'pass'
else:
print 'fail'
except Exception, e:
print 'error', e
pass
- Why is this change important?
- Need to be able to run all tests, all the time
- If the first test raises an exception that isn't handled, the other ninety-nine tests won't run at all
- So, what's the problem?
- Passing
loX to Rect's constructor twice, rather than loX and hiX - Fix and rerun tests
- Testing
loY > hiY instead of loY >= hiY - Fix and rerun tests
- We ought to test that
overlap works when rectangles do overlap try:
r1 = Rect(0, 0, 3, 3)
r2 = Rect(1, 1, 2, 2)
result = overlap(r1, r2)
if result == Rect(1, 1, 2, 2):
print 'pass'
else:
print 'fail'
except Exception, e:
print 'error', e
error name 'overlap' is not defined
- This is only one case out of many
![[Possible Overlap Tests]](img/test/possible_overlaps.png)
Figure 9.1: Possible Overlap Tests |
- How many tests should we write? When should we stop?
- Comment on this slide
General Rules for Unit Tests
- Make tests independent of each other
- No interaction between tests
- Nothing one test does can affect any other test
- In particular, the order in which tests are run shouldn't matter
- If two tests depend on one another, then when the first fails, the second either:
- Isn't run (so you have less information), or
- Runs, but gives you the wrong answer (which is even worse than no information)
- Keep tests small
- Each tests exactly one thing
- When a test fails, you'll know the cause right away
- Don't hard-code absolute path names, magic numbers, etc.
- Want to be able to move tests from machine to machine
- Have to b able to maintain tests as code evolves
- Test code should be at least as well written as production code
- Comment on this slide
A Simple Testing Framework
- Step back a moment and think about how we're writing our tests
- Each has two rectangles as input, and expects either a rectangle or
None - Each reports passes, fails, or errors
- Design principle: every time you see a pattern, try to capture it in code
- Write the shared code once
- Encourage people to do things the right way by making that way easiest
- Step 1: write a function that runs a test and reports its result
def runTest(left, right, expected):
try:
assert left is not None
assert right is not None
actual = overlap(left, right)
if actual == expected:
return 'pass'
else:
return 'fail'
except Exception, e:
return 'error'
- Just as easy to return strings as to define constants for pass, fail, and error
- Note: assertions on input arguments are inside the
try/except
- Passing
None as the left or right rectangle is an error in the test
- Step 2: write some tests
- Whenever possible, create a table of inputs and results
- Easier for people to read, understand, and add to than code
- Especially if those people aren't programmers
def runTest(left, right, expected):
try:
assert left is not None
assert right is not None
actual = overlap(left, right)
if actual == expected:
return 'pass'
else:
return 'fail'
except Exception, e:
return 'error'
- We can make this even more flexible by writing a function that tests an arbitrary function
- Remember that
def f(*args) creates a function that takes any number of arguments - If you have a tuple of values called
vals, then f(*vals) matches the values with f's arguments- E.g., if
vals is (11, 12, 13), then f(*vals) is the same as f(11, 12, 13) - Gives you a way to call an arbitrary function with an arbitrary set of arguments in a uniform way
- Use this to write a function
runAllTests that takes two arguments:
- The function you actually want to test
- A list of lists, each of which consists of zero or more arguments, and the expected result
def runAllTests(func, tests):
pass = fail = error = 0
for testCase in tests:
try:
args, expected = testCase[:-1], testCase[-1]
actual = func(*args)
if actual == expected:
pass += 1
else:
fail += 1
except Exception, e:
error += 1
for (label, count) in (('pass', pass), ('fail', fail), ('error', error)):
print '%s: %4d' % (label, count)
runAllTests(overlap, tests)
- Advantage: reusability
- Can use this to test any function
- Disadvantage: obscurity
- The extra level of abstraction makes it harder to understand what's going on
- But you only have to understand one hard thing once, instead of many slightly less hard things over and over
- Take a look in the next lecture at a full-fledged unit testing framework
- Comment on this slide
Choosing Test Cases
- You can't test everything
- There are roughly 10 million 7-digit phone numbers
- A function that puts two phone numbers in order has 1014 possible inputs
- At a million tests a second, it will take 108 seconds (or roughly three years) to check them all
- And then we move on to the second function…
- Faster computers won't get us out of this hole
- How many possible inputs are there to a function that sorts a list of strings?
- Human beings are creatures of habit
- Tend to make the same kinds of errors over and over again
- So test for those first
- Has an interesting side effect: once you start testing for habitual errors, programmers become more conscious of them, and make them less often
- The first tests you should write:
- Tests you expect to succeed
- Boundary cases (e.g., sort the empty list, or a list of one value)
- Simplest interesting case (e.g., sort a list of two values)
- General case (e.g., sort a list of nine values)
- If duplicate values are allowed, make sure you test with them
- Tests you expect to fail
- Invalid input (e.g., passed a dictionary instead of a list)
- Remember, error handling is part of the interface too
- Sanity tests
- Make sure data structures remain consistent
- If there is redundant information, check it against itself
- A catalog of errors
- Numbers: zero, largest, smallest magnitude, most negative
- Structures: empty, exactly one element, maximum number of elements
- Duplicate elements (e.g., the letter
"J" appears three times in a string) - Aliased elements (e.g., a list contains two references to another list)
- Circular structures (e.g., a list that contains a reference to itself)
- Searching: no match found, one match found, multiple matches found, everything matches
- Code like
x = findAll(structure)[0] is almost always wrong - Should also check aliased matches (same thing found multiple times)
- Comment on this slide
Dictionaries and Error Handling
Motivation
- Problem: you want to count how often different people send messages to a mailing list
- Input: a file with one email address per line
- Each address may appear many times
- Output: a table of addresses, sorted by frequency of appearance
- Solution: keep track of counts in a list of pairs
- Each entry in the list is
[address, countSoFar] - Each time an address is read in, search the list, and either:
- Increment the count of an existing entry, or
- Create a new entry with a count of 1
- Once everything has been read, sort the entries
- Note: you can pass your own comparison function to
list.sort
import sys
def countFind(counts, address):
for x in counts:
if x[0] == address:
return x
return None
def countFreq(lines):
counts = []
for line in lines:
line = line.strip()
existing = countFind(counts, line)
if existing:
existing[1] += 1
else:
counts.append([line, 1])
return counts
def compareByFrequency(a, b):
if a[1] < b[1]:
return 1
if a[1] == b[1]:
return 0
return -1
if __name__ == '__main__':
instream = open(sys.argv[1], 'r')
lines = instream.readlines()
instream.close()
result = countFreq(lines)
result.sort(compareByFrequency)
for address, frequency in result:
print frequency, ":", address
- This is clumsy and inefficient
- Clumsy: writing a lot of code to do something that ought to be simple
- Inefficient: repeatedly searching the entire list
- There's an easier way…
- Comment on this slide
String Formatting
- But before we start, let's tidy up a few loose ends
- Python's
% operator is used to format strings- Left side is a string specifying how things are to be formatted
- Right side is the values to be formatted
- One value on its own…
- …or a tuple if there are many values
'here %s go' % 'we' creates a new string "here we go"
- The
"%s" in the left string means “insert a string here” - So the string on the right is inserted
- Since strings are immutable, this creates a new string (does not modify the format string in place)
'left %d right %d' % (-1, 1) creates "left -1 right 1"
"%d" stands for “decimal integer”- Format specifiers are matched with values left to right
'do %s %d times' % ('this', 99) creates "do this 99 times"
'[%4d]' % 13 creates "[ 13]"
- The integer between the
"%" and "d" specifies a width '[%-4d]' % 13 creates "[13 ]", because negative widths mean “left justify”
'[%6.2f %%]' % 37.2 creates "[ 37.20 %]"
"%6.2f" means “a floating point number, six characters wide, with two after the decimal point”"%%" is translated into a single "%" character- Just as
"\\" is how you represent a single "\" in a string
'[%6.2e %%]' % 37.2 creates "[3.72e+001 %]"
"%e" is scientific (exponential) notation- Python will use more characters than it's been told to if it has to
- Comment on this slide
Dictionaries
- A sequence (such as a string or list) associates one value with each integer from 0 to N-1
- A dictionary associates one value with each of its keys
- Also called maps, hashes, and associative arrays
- Keys don't have to be integers
- In fact, they are usually strings
- Often visualized as a two-column table
- Each key in the left column must be unique
- Values can occur several times
- Example planet names as keys, and the lengths of their years as values
| Mercury | 87.97 |
| Venus | 224.70 |
| ⋮ | ⋮ |
| Pluto | 90550.0 |
- Comment on this slide
The Mechanics
- Create a dictionary by putting key/value pairs inside
{}
{'Newton':1642, 'Darwin':1809}- Empty dictionary written
{} - Get a value from an existing dictionary using
[] (just like everything else in Python) birthday = {
'Newton' : 1642,
'Darwin' : 1809
}
print "Darwin's birthday:", birthday['Darwin']
print "Newton's birthday:", birthday['Newton']
Darwin's birthday: 1809
Newton's birthday: 1642
- Assigning to a dictionary key:
- Creates a new entry if the key is not already in dictionary
- Overwrites the previous value if the key is already present
birthday = {}
birthday['Darwin'] = 1809
birthday['Newton'] = 1942 # oops
birthday['Newton'] = 1642
print birthday
{'Darwin': 1809, 'Newton': 1642}
- Can only access keys that are present
- Just as you can't index elements of a list that aren't there
birthday = {
'Newton' : 1642,
'Darwin' : 1809
}
print birthday['Turing']
Traceback (most recent call last):
File "key_error.py", line 5, in ?
print birthday['Turing']
KeyError: 'Turing'
- Remove an entry using
del d[k]
- Can only remove entries that are actually present
birthday = {
'Newton' : 1642,
'Darwin' : 1809,
'Turing' : 1912
}
print 'Before deleting Turing:', birthday
del birthday['Turing']
print 'After deleting Turing:', birthday
Before deleting Turing: {'Turing': 1912, 'Newton': 1642, 'Darwin': 1809}
After deleting Turing: {'Newton': 1642, 'Darwin': 1809}
for k in d loops over the dictionary's keys (rather than its values)
- Note: this is different from lists, where
for loops over the values, rather than the “keys” (indices) birthday = {
'Newton' : 1642,
'Darwin' : 1809,
'Turing' : 1912
}
for name in birthday:
print name, birthday[name]
Turing 1912
Newton 1642
Darwin 1809
- Test whether a key
k is in a dictionary d using k in d
- Once again, inconsistent with behavior of lists, but useful
birthday = {
'Newton' : 1642,
'Darwin' : 1809
}
for name in ['Newton', 'Turing']:
if name in birthday:
print name, birthday[name]
else:
print 'Who is', name, '?'
Newton 1642
Who is Turing ?
- Comment on this slide
Dictionary Methods
- Yes, dictionaries are objects too…
| Method | Purpose | Example | Result |
|---|
clear | Empty the dictionary. | d.clear() | Returns None, but d is now empty. |
get | Return the value associated with a key, or a default value if the key is not present. | d.get('x', 99) | Returns d['x'] if "x" is in d, or 99 if it is not. |
keys | Return the dictionary's keys as a list. Entries are guaranteed to be unique. | birthday.keys() | ['Turing', 'Newton', 'Darwin'] |
items | Return a list of (key, value) pairs. | birthday.items() | [('Turing', 1912), ('Newton', 1642), ('Darwin', 1809)] |
values | Return the dictionary's values as a list. Entries may or may not be unique. | birthday.values() | [1912, 1642, 1809] |
update | Copy keys and values from one dictionary into another. | See the example below. | |
Table 10.1: Dictionary Methods
- Example:
birthday = {
'Newton' : 1642,
'Darwin' : 1809,
'Turing' : 1912
}
print 'keys:', birthday.keys()
print 'values:', birthday.values()
print 'items:', birthday.items()
print 'get:', birthday.get('Curie', 1867)
temp = {
'Curie' : 1867,
'Hopper' : 1906,
'Franklin' : 1920
}
birthday.update(temp)
print 'after update:', birthday
birthday.clear()
print 'after clear:', birthday
keys: ['Turing', 'Newton', 'Darwin']
values: [1912, 1642, 1809]
items: [('Turing', 1912), ('Newton', 1642), ('Darwin', 1809)]
get: 1867
after update: {'Curie': 1867, 'Darwin': 1809, 'Franklin': 1920, 'Turing': 1912, 'Newton': 1642, 'Hopper': 1906}
after clear: {}
- Comment on this slide
Counting Frequency
- Revisit the problem of counting how often things appear in a file
- Use the things (email addresses) as keys in a dictionary
- The value associated with each key is the number of times it has been seen so far
# Data to count.
names = ['Be','Mg','Mg','Ca','Be','Mg', 'Be']
# Build a dictionary of frequencies.
freq = {}
for name in names:
# Already seen, so increment count by one.
if name in freq:
freq[name] = freq[name] + 1
# Never seen before, so add to dictionary.
else:
freq[name] = 1
# Display.
print freq
{'Be': 3, 'Mg': 3, 'Ca': 1}
![[Counting Frequency]](img/py04/freq_count.png)
Figure 10.1: Counting Frequency |
- Note: dictionaries do not store keys in order
- No way to predict where a dictionary will insert a new value
- Paradoxically, this is part of what makes dictionaries so fast
- So, to print frequency in alphabetic order by address:
- Get the list of keys
- Sort that list
- Loop over it
# Build the frequency dictionary as before.
names = ['Be','Mg','Mg','Ca','Be','Mg', 'Be']
freq = {}
for name in names:
if name in freq:
freq[name] = freq[name] + 1
else:
freq[name] = 1
# Print in alphabetical order by key.
keys = freq.keys()
keys.sort()
for k in keys:
print k, freq[k]
Be 3
Ca 1
Mg 3
- Can simplify this code using
dict.get
- Get either the count associated with the key, or 0, then add one to it
# Use 'get' to simplify the counting loop.
names = ['Be','Mg','Mg','Ca','Be','Mg', 'Be']
freq = {}
for name in names:
freq[name] = freq.get(name, 0) + 1
# Print.
keys = freq.keys()
keys.sort()
for k in keys:
print k, freq[k]
Be 3
Ca 1
Mg 3
- But how to print in order of frequency?
- Need to invert the dictionary (i.e., swap keys and values)
- Problem: there might be collisions, since values aren't guaranteed to be unique
- What is the inverse of
{'a':1, 'b':1, 'c':1}?
- Solution: store a list of values instead of just a single value
# Count values as before.
names = ['Be','Mg','Mg','Ca','Be','Mg', 'Be']
freq = {}
for name in names:
freq[name] = freq.get(name, 0) + 1
# Invert.
inverse = {}
for (key, value) in freq.items():
seen = inverse.get(value, [])
seen.append(key)
inverse[value] = seen
# Print.
keys = inverse.keys()
keys.sort()
for k in keys:
print k, inverse[k]
1 ['Ca']
3 ['Be', 'Mg']
![[Tracing Execution of Dictionary Inversion]](img/py04/dict_count_inverse_trace.png)
Figure 10.2: Tracing Execution of Dictionary Inversion |
- Comment on this slide
Formatting Strings with Dictionaries
- Complex string formatting can be hard to understand
- Especially if one value needs to be used several times
- E.g., producing an email form letter, and need to put the person's name in three places
- Instead of a tuple,
"%" can take a dictionary as its right argument- Use
"%(varname)s" or "%(varname)4d" inside format string to identify what's to be substituted '%(word)s, t%(word)s, and everyw%(word)s' % {'word' : 'here'} creates "here, there, and everywhere"
- Comment on this slide
Catching Errors
- Like other modern languages, Python uses exceptions to handle errors
- Separating normal (expected) operation from exceptional cases makes code easier to read
- Structured like
if/else
- Introduce a block of statements with
try - Then introduce error handling code with
except - When something goes wrong in the
try block, Python raises an exception - Python finds the matching
except block, and executes whatever instructions it contains print 'before try/except'
try:
print '..first line of try block'
print 1/0
print '..last line of try block'
except:
print '..handling error!'
print 'after try/except'
before try/except
..first line of try block
..handling error!
after try/except
![[Flow of Control in Try/Except]](img/py04/simple_try_except.png)
Figure 10.3: Flow of Control in Try/Except |
- You can include code to execute when there isn't an error using
else
for val in (0.0, 1.0):
print "val is", val
try:
x = 1.0/val
except:
print '..handling error!'
else:
print '...in the else block'
val is 0.0
..handling error!
val is 1.0
...in the else block
- Comment on this slide
Exception Objects
- When an error occurs, Python actually creates an object to store information about it
- Typically an error message
- Can choose to handle specific errors by specifying an exception type in the
except statement- E.g., handle division by zero, but not out-of-bounds list index
- Syntax is
except ExceptionType, variable
ExceptionType is what you want to catch- If something of that type is raised, the exception object is assigned to
variable
values = [-1.0, 0.0, 1.0]
for i in range(4): # note: top index will be out of bounds
try:
x = 1.0 / values[i]
except ZeroDivisionError, e:
print 'divide by zero:', e
divide by zero: float division
Traceback (most recent call last):
File "except_with_type.py", line 4, in ?
x = 1.0 / values[i]
IndexError: list index out of range
- You can associate any number of
except blocks with a single try
- They're tested in order—whichever matches first, wins
- If a “naked”
except appears, it must be the last one (since it catches everything)
- Better to use
except Exception, e so that you have the exception object
- Exceptions are organized in a hierarchy
- E.g.,
ZeroDivisionError, OverflowError, and FloatingPointError are all specialized versions of ArithmeticError - And
IndexError (list/string index out of bounds) and KeyError (non-existent dictionary key) are specializations of LookupError for i in range(2):
try:
if i == 0:
vals = ['a', 'b']
x = vals[99]
else:
vals = {20 : 'a', 30 : 'b'}
x = vals[40]
except KeyError, e:
print 'loop %d, handling key errors:' % i, e
except LookupError, e:
print 'loop %d, handling generic lookup errors:' % i, e
loop 0, handling generic lookup errors: list index out of range
loop 1, handling key errors: 40
- We'll see in a later lecture how this hierarchy is implemented
- Hint: it has something to do with objects…
- Common exception types
| Name | Purpose |
|---|
Exception | Root of exception hierarchy. |
ArithmeticError | Illegal arithmetic operation |
IndexError | Bad index to sequence (out of bounds or illegal type) |
KeyError | Nonexistent index to dictionary |
TypeError | Illegal type (e.g., trying to add integer and string) |
ValueError | Illegal value (e.g., math.sqrt(-1)) |
IOError | Unable to create or open file, read data, etc. |
Table 10.2: Common Exception Types
- Comment on this slide
Functions and Exceptions
- So if a function raises an exception, who gets to handle it?
- Each time Python enters a
try/except block, it pushes the except handlers on a stack- Just like the function call stack
- When an exception is raised, Python look in this stack, and jumps to the first matching handler
def invert(vals, index):
try:
vals[index] = 1/vals[index]
except ArithmeticError, e:
print 'inner exception handler:', e
def each(func, vals, indices):
try:
for i in indices:
func(vals, i)
except IndexError, e:
print 'outer exception handler:', e
# Note: top index will be out of bounds
allValues = [-1, 0, 1]
allIndices = [0, 1, 2, 3]
each(invert, allValues, allIndices)
calling func for index 0
calling func for index 1
inner exception handler: integer division or modulo by zero
calling func for index 2
calling func for index 3
outer exception handler: list index out of range
![[Stacking Exception Handlers]](img/py04/exception_stack.png)
Figure 10.4: Stacking Exception Handlers |
- Comment on this slide
Raising Exceptions
- Use
raise to trigger exception processing- Obviously, have to specify the type of exception you're raising
- E.g.,
raise Exception('this is an error message')
- Please make your error messages more informative…
for i in range(4):
try:
# Raise exception if value is odd.
if (i % 2) == 1:
raise ValueError('index is odd')
else:
print 'not raising exception for %d' % i
except ValueError, e:
print 'caught exception for %d' % i, e
not raising exception for 0
caught exception for 1 index is odd
not raising exception for 2
caught exception for 3 index is odd
- Raise exceptions to report errors in functions
- Rather than returning
None or -1 or something like that list.find (which returns -1 if something can't be found) is unusual in this respect- But it's more convenient than
list.index, which throws an exception
- General rule: throw low, catch high
- I.e., throw lots of very specific exceptions, but only catch them in a few places, high in your code, where you can take corrective action
- Reason #1: every application handles errors differently
- If someone is using your library in a GUI, you don't want to be printing to
stderr
- Reason #2: error handling code gets in the way of understanding the algorithm, so disentangle the two
import sys
# Look for the first matching line in a file.
def findFirstMatchingLine(filename, word):
infile = open(filename, 'r')
lines = infile.readlines()
infile.close()
for line in lines:
if line.find(word) >= 0:
return line.rstrip()
errMsg = '%s does not contain %s' % (filename, word)
raise ValueError(errMsg)
if __name__ == '__main__':
try:
word = sys.argv[1]
for filename in sys.argv[2:]:
line = findFirstMatchingLine(filename, word)
print '%s / %s => %s' % (filename, word, line)
except ValueError, e:
print >> sys.stderr, e
except IOError, e:
print >> sys.stderr, e
except Exception, e:
print >> sys.stderr, e
import sys
# Look for the first matching line in a file.
def findFirstMatchingLine(filename, word):
infile = open(filename, 'r')
lines = infile.readlines()
infile.close()
for line in lines:
if line.find(word) >= 0:
return line.rstrip()
errMsg = '%s does not contain %s' % (filename, word)
raise ValueError(errMsg)
if __name__ == '__main__':
try:
word = sys.argv[1]
for filename in sys.argv[2:]:
line = findFirstMatchingLine(filename, word)
print '%s / %s => %s' % (filename, word, line)
except ValueError, e:
print >> sys.stderr, e
except IOError, e:
print >> sys.stderr, e
except Exception, e:
print >> sys.stderr, e
- Comment on this slide
Assertions
- An assertion is a claim about the state of a program that must be true if the program is correct
- E.g., “this list is not empty”, or “this value is greater than that one”
- Three kinds of assertions:
- A pre-condition is something that must be true in order for something else to work
- E.g., “argument is non-negative” is a pre-condition for
math.sqrt
- A post-condition is something that is guaranteed to be true after something else happens
- E.g., “result is either -1 or a legal index to the string” is a post-condition of
string.find
- Anything else is simply called a state assertion
- E.g., “this loop always executes at least once”
- Embed assertions in programs using the
assert statement- First argument is a Boolean condition
- Second (optional) is an error message
- If the condition isn't satisfied, Python raises an
AssertionError exception def findRange(values):
assert (values != None) and (len(values) > 0), 'Illegal input'
left = min(values)
right = max(values)
assert left < right, 'Empty range'
return (left, right)
- Good programmers practice defensive programming by putting lots of assertions in their code
- “Executable documentation”
- Describes how the program is supposed to behave, in a way that the computer can check
- Can't ever be out of step with the actual code, since it is code
- Fail early, fail often
- The less time that elapses between when an error occurs, and when its effects show up, the easier it is to fix
- Bugs grow up to be assertions
- If you made an error once, there's a good chance you'll make it again
- Adding an assertion helps you find and fix it faster the second time
- Assertions should only be used to check the program's correctness, not the correctness of its inputs
- I.e., don't use
assert to check that your program managed to open a file - It isn't the program's fault if it doesn't have read permission…
- Complex assertions are more likely to contain bugs than to catch them
- If it won't fit comfortably on one line, it probably shouldn't be an assertion
- Comment on this slide
Running Other Programs
- One program can run another by spawning a new process
- Make and the shell both do this
- So can your programs
os.popen runs a command, and lets you either:
- Write to its standard input, or
- Read from its standard output
![[Running Another Program]](img/py04/os_popen.png)
Figure 10.5: Running Another Program |
- Example: read the output of the
ls command into your programimport os
instream = os.popen('ls', 'r')
lines = instream.readlines()
instream.close()
for line in lines[:4]:
line = line.rstrip()
print '%4d: %s' % (len(line), line)
13: addresses.txt
13: assertions.py
14: create_dict.py
18: create_dict.py.out
- Example: send the output of your program to
gzip for compression:
import os
# Create compressed output and report its size.
outstream = os.popen('gzip -c > temp.txt.gz', 'w')
for i in range(1000):
for word in 'this is a test'.split():
print >> outstream, word
outstream.close()
print 'output is %d bytes' % os.stat('temp.txt.gz').st_size
# See how big it would have been.
instream = os.popen('gunzip -c temp.txt.gz', 'r')
total = 0
for line in instream.readlines():
total += len(line)
instream.close()
print 'output would have been %d bytes' % total
output is 79 bytes
output would have been 15000 bytes
- You can use
os.popen2 to connect to the other program's input and output simultaneouslyimport os
childIn, childOut = os.popen2('sort')
for word in 'this is a test'.split():
print >> childIn, word
childIn.close()
for line in childOut.readlines():
print line.rstrip()
childOut.close()
a
is
test
this
- But you must be very careful about deadlock
- If the child is trying to write to the parent while the parent is trying to write to the child, both are blocked
- The operating system can only hold on to a limited amount of input or output data at a time, so increasing the buffer size only delays the problem
![[Deadlock]](img/py04/deadlock.png)
Figure 10.6: Deadlock |
- Things are even trickier if you use
os.popen3, which gives you the child process's stdin, stdout, and stderr
- If the parent is trying to read from the child's
stdout while it's trying to write an error message to stderr, both programs will be blocked
- Solutions are outside the scope of this course
- Use
os.popen (with temporary files if necessary) until you're comfortable enough to move on
- Comment on this slide
Exercises
Exercise 10.1:
Suppose you wanted to sort entries with the same frequency
alphabetically. What changes would you have to make to
compareByFrequency?
Debugging
What It Is
- Finding and fixing flaws in software
- Assume for now that you built the right thing the wrong way
- Requirements errors are actually a major cause of software project failure
- You're going to spend half your professional life debugging
- After all, once it's fixed, you move on to something else
- So learn how to do it well
- Talk about tools first (because they'll make everything else less painful)
- Then some simple rules
- Remember: never debug standing up
- The faster you try to work, the slower you'll go
- Comment on this slide
What's Wrong with Print Statements
- Many people still debug by adding print statements to their programs
- “Gee, I wonder what the value of
maxTemp is when I call resolve?”
- Debugging with print statements is:
- Error-prone: adding print statements is a good way to add typos
- Particularly when you have to modify the block structure of your program to fit it in
- Time-consuming
- All that typing…
- And (if you're using Java, C++, or Fortran) all that recompiling…
- Misleading
- Adding print statements moves things around in memory, changes execution timing, etc.
- Common to have bugs that hide when print statements are added, and reappear when they're removed
- There's a much better way
- Comment on this slide
Symbolic Debuggers
- A symbolic debugger is a program that runs another program on your behalf
- Called “symbolic” because it can show you the source code you wrote, rather than the raw machine code that's executing
- Using a debugger is one of the two things that separates professionals from amateurs
- Using version control is the other
- While the target program (or debuggee) is running, the debugger can:
- Pause, resume, or restart the target
- Inspect and display the values of variables
- Change those values
- Watch for certain events
- Do not need to modify the source of the target program!
- Depending on your language, you may need to compile it with different flags
- And yes, the debugger modifies the target's layout in memory, and execution speed…
- …but a lot less than print statements…
- …with a lot less effort from you
- Comment on this slide
Running in a Debugger
- Depending on language and tools, there may be several ways to get into the debugger
- Launch the debugger, load the target program, and start work
- Run the debugger with the target program as a command-line argument
- In interpreted languages like Python, you can also switch into debugging mode in the middle of an interactive session
- Sometimes also do post mortem debugging
- When a program fails badly (e.g., illegal memory access in C++), it creates a core dump
- Copies all of its internal state to a file on disk
- Can then load that dump into the debugger, and see where the program was when it died
- Not as good as watching it run…
- …but sometimes the best you can do
- Modern graphical debuggers typically show:
- The source code
- The call stack
- The values of variables that are currently in scope
- I.e., global variables, parameters to the current function call, and local variables in that function
- A panel displaying what your program has printed to standard output and/or standard error
![[A Debugger in Action]](img/debugging/debugger_in_action.png)
Figure 11.1: A Debugger in Action |
- If the debugger is part of a larger integrated development environment (IDE), it may show other things as well
- A source browser that presents an outline of the project's modules, classes, functions, variables, etc.
![[Source Browser]](img/debugging/source_browser.png)
Figure 11.2: Source Browser |
- A code assistant that presents context-sensitive help and documentation
![[Code Assistant]](img/debugging/code_assistant.png)
Figure 11.3: Code Assistant |
- Note: tools like this are available for every modern language
![[Microsoft Visual Studio in Action]](img/debugging/msvstudio.png)
Figure 11.4: Microsoft Visual Studio in Action |
- Comment on this slide
Basic Operations
- The debugger can set breakpoints in the target program
- Tells the target to pause when it reaches particular lines of code
- Once the program is paused, the debugger can single-step it
- Execute one statement at a time
- Most can:
- When the target is paused, the debugger can display the contents of its memory
- The compiler (or interpreter) keeps track of the types of values in memory…
- …so the debugger can show integers as integers, rather than as sequences of bytes
![[Inspecting Values]](img/debugging/inspecting_values.png)
Figure 11.5: Inspecting Values |
- Most debuggers can also evaluate expressions using the current values of variables
- E.g., type in
2*x<0, debugger displays False - Yes, this means that every debugger contains a mini-interpreter for the language in question…
- This is a lot easier than adding print statements…
- Comment on this slide
How Debuggers Work
- A running program is:
- A bunch of bytes in memory whose values are instructions
- Each statement in the source program typically corresponds to several instructions
- The compiler (or interpreter) can produce a table that records this mapping
- Another bunch of bytes whose values are the program's variables
- The static space contains constant strings, magic numbers, etc.
- The call stack is the parameters to all currently-active function calls
- The heap is all dynamically-created objects
- Some registers
- An instruction pointer that keeps track of what to execute next
- A stack pointer to keep track of the call stack
- Miscellaneous other registers for doing arithmetic, etc.
- In order to set a breakpoint on a particular line, a debugger:
- Figures out which instructions were produced for the statement on that line
- Records the first of those instructions in a safe place
- Replaces that instruction with a
HALT instruction- Every processor and virtual machine has one of these
![[Creating a Breakpoint]](img/debugging/setting_breakpoint.png)
Figure 11.6: Creating a Breakpoint |
- When the target program reaches the
HALT instruction:
- It stops, and sends a signal to the operating system…
- …which passes the signal on to the debugger…
- …which waits for instructions from the user
- Such as “Show me the value of
maxTemp”
- When the user wants the program to resume, the debugger:
- Puts the instruction at the breakpoint location back in place
- Tells the program to execute just that instruction
- Every processor and virtual machine has a “single step” mechanism
- Replaces the instruction with a
HALT once again- Otherwise, the breakpoint will disappear
- Tells the program to run normally
- Comment on this slide
Advanced Operations
- Debugger lets you move up and down the call stack
- Remember, the set of outstanding function calls is just a bunch of variables in memory
- Debugger can ask how large each one is, so it knows where to go to get to the previous stack frame
- Lets you figure out how execution got to a particular place
- Otherwise known as, “What?? We shouldn't be here!!”
- Can modify variables in a running program, too
- If you have a theory about why an intermittent bug is occurring:
- Run to that point
- Set variables' values (e.g., set
maxTemp to -1) - Resume execution
- Sometimes used to test error handling code
- Easier to change
timeSpentWaiting to 600 seconds in debugger than to pull out the network cable and wait…
- Can set conditional breakpoints
- Only pauses program if some condition is met
- Loop index greater than 100
- Filename argument to function is
None
- Some debuggers also support watchpoints
- Have the debugger watch every write to memory
- Halt when anything, anywhere, modifies a particular variable
- Slow the program down a lot
- Typically a factor of 100 or more
- But sometimes the only practical way to find out why a string in C is being overwritten
- Comment on this slide
Rule 0: Get It Right the First Time
- The simplest bugs to fix are the ones that don't exist
- Design, reflect, discuss, then code
- A week of hard work can sometimes save you an hour of thought
- Make good style a habit
- Train yourself to do things right, so that you'll code well even when you're tired, stressed, and facing a deadline
- Comment on this slide
Rule 1: What Is It Supposed to Do?
- First step is knowing what the problem is
- “It doesn't work” isn't good enough
- What exactly is going wrong?
- How do you know?
- Requires you to know how the software is supposed to behave
- Is this case explicit in the specification?
- If not:
- Do you have enough knowledge to extrapolate?
- Do you have the right to do so?
- Write it down!
- Preferably away from the keyboard and screen
- Try not to let what you're seeing influence what you expect to see
- Cognitive dissonance leads us to adjust what we think we're observing in order to fit preconceptions
- Question: What do testers do while waiting for code?
- Answer: read the specification and start writing down test cases
- If there's no specification, you're wasting valuable time
- And if QA isn't involved in drawing up the specification, to guarantee testability, you're wasting even more
- Comment on this slide
Rule 2: Is It Plugged In?
- Are you actually exercising the problem that you think you are?
- Are you giving it the right test data?
- Is it configured the way you think it is?
- Is it the version you think it is?
- Has the feature actually been implemented yet?
- Why are you sure?
- Maybe the reason you can't isolate the problem is that it's not there
- Another argument in favor of automatic regression tests
- Guaranteed to rerun the test the same way each time
- Also a good argument against automatic regression tests
- If the test is wrong, it will generate the same false negative each time
- Comment on this slide
Rule 3: Make It Fail
- You can only debug things when you can see them going wrong
- Find a test case that makes the code fail
- Then try to simplify it
- Then try to find another one
- Use data to weed out false hypotheses
- Remember, it's computer science, not computer flip-a-coin
- If you can't make it fail repeatedly, you're going to have to rely on:
- Post-mortem inspection (autopsies)
- But then you have to reason backwards to figure out why the program crashed
- Logging with print statements
- But this can distort the program's behavior
- Comment on this slide
Rule 4: Divide and Conquer
- Once you have a test that makes the system fail, isolate the faulty subsystem
- Faults can only occur upstream from where the system fails
- Look at input to the module that's failing
- If that's wrong, look at preceding module's input, and so on
- Fail early, fail often
- Use
assert to trap consistency errors early - Every time you fix a bug, see whether you can add an assertion
- If you made the mistake once, odds are that you, or someone, will make it again
- Another strong argument against duplicated code
- Comment on this slide
Rule 5: Change One Thing at a Time, For a Reason
- Replacing random chunks of code unlikely to do much good
- If you got it wrong the first time, what makes you think you'll get it right the second? Or the ninth?
- So always have a hypothesis before making a change
- Change one thing, then re-run your tests
- Changes can often uncover (or introduce) new bugs
- The more things you change at once, the harder it is to know what's responsible for what
- And the harder it is to keep track of what you've done, and what effect it had
- Comment on this slide
Rule 6: Write It Down
- Science works because scientists keep records
- If you try to keep the state of play in your head, you will lose track (and time)
- “Wait, didn't A followed by B with an odd number of lines cause those symptoms? Or was it B followed by A? Or was it A and B with an even number of lines?”
- Records particularly useful when getting help
- People are more likely to listen when you can explain clearly what you did
- Comment on this slide
Rule 7: Be Humble
- If you can't find it in 15 minutes, ask for help
- Just explaining the problem aloud is often enough
- Don't keep telling yourself why it should work: if it doesn't, it doesn't
- Never debug while grinding your teeth, either…
- Keep track of your mistakes
- Just as runners keep track of their time for the 100 meter sprint
- You're most likely to fix what you're paying attention to
- Comment on this slide
Summary
- How productive you are depends on:
- What tools you use
- How well you use them
- A symbolic debugger will make you more productive
- Like any other tool, there's a learning curve
- But after a few sessions, you will have more working code in less time
- The best debugger in the world is no substitute for working methodically
- The scientific method is the best tool we've ever invented for finding out “why”
- Comment on this slide
Object-Oriented Programming
Motivation
- How should a programmer represent XY points?
- Using a list of two numbers is convenient, but:
- People reading the code have to remember (or guess) that
p[1] means “the Y coordinate” - There's nothing to stop someone accidentally appending a third “coordinate”, or setting the X coordinate to
"orange"
- (Almost) all modern languages encourage programmers to define abstract data types
- Called abstract because they hide the details of their implementation
- Programmers interact with them through a limited set of operations, rather than by manipulating data directly
- Fewer things can go wrong
- Easier to read resulting code
- Makes code easier to maintain, since internals can be changed without changing calling code
- An ADT is usually created by defining a class, then creating one or more objects that are instances of that class
- All objects have the same methods (abilities), but their own state
- Changes to one object do not affect the state of others
- Comment on this slide
A Naked Class
- Define a class in Python using the
class keyword- Like other keywords,
class is followed by ":", and introduces block - Name of the new class is usually followed by
object in parentheses- We'll see other options later
- Create a new instance of the class by “calling” the class name as if it were a function
- If the class's body is
pass, this creates a blank object - See in a moment how to make objects more interesting
- Most objects have data members that store their internal state
- A “personal” variable, like “my nose” or “my watch”
- Use the expression
obj.data to access the data member of obj
- Unlike C++ and Java, Python has no notion of “hiding” data members
- Create data members simply by referring to them
- The expression
obj.x = 3 either:
- Gives
obj a new member x, and assigns it the value 3, or - Overwrites the existing member
x with the value 3
- Just like creating variables in a function
- Example: an XY point
class point2d(object):
pass
if __name__ == '__main__':
p = point2d()
p.x = 0.0
p.y = 3.0
print "point is", p
print "coordiantes are", p.x, p.y
- Comment on this slide
Methods
- Naked classes are just glorified dictionaries
- Instead of
obj['x'], have obj.x
- Really become interesting when we start to define methods
- Any function defined inside a class belongs to that class
- When a method is called, Python passes the object itself as the method's first argument
- Which means that every method must take at least one argument
- By convention, it is called
self
- Why use methods?
- Make code easier to understand by grouping operations
- Give programmers a place to check the integrity of their ADTs
- The expression
obj.meth(arg) is equivalent to:
- Find the class that
obj is an instance of - Call that class's
meth method with obj and arg as arguments
- When a new object is created, Python checks whether the class has defined a method called
__init__
- If it has, Python calls it before doing anything else
- This constructor can give the object its initial data members simply by assigning them values
- Inside a method, refer to the object's
data member as self.data
- Since the object was passed to the method as the parameter
self - The expression
data is not the same as self.data
data on its own means “a local variable called data”
- Example: a 2D point in the upper right quadrant
class point2d(object):
# Constructor defines how to create new objects.
def __init__(self):
self.x = 0
self.y = 0
# Get X coordinate.
def getX(self):
return self.x
# Set X coordinate.
def setX(self, newX):
assert newX >= 0
self.x = newX
# Get and set Y coordinate.
def getY(self):
return self.y
def setY(self, newY):
assert newY >= 0
self.y = newY
# Calculate norm.
def norm(self):
return math.sqrt(self.x ** 2 + self.y ** 2)
- The
__init__ constructor sets the point's initial coordinates - The getter methods replace raw read access
- The setter methods check the constraints on the state of the class
- If someone tries to assign a negative coordinate, the class will raise an exception
- Fail early, fail often
- Note: when testing the class, test the error handling too
# Tests.
if __name__ == '__main__':
p = point2d()
assert p.getX() == 0.0
assert p.getY() == 0.0
p.setX(3.0)
p.setY(4.0)
assert p.getX() == 3.0
assert p.getY() == 4.0
assert p.norm() == 5.0
try:
p.setX(-1.0)
assert False
except AssertionError:
pass
- Keep the memory model in mind when thinking about objects
![[Memory Model]](img/oop/memory_model.png)
Figure 12.1: Memory Model |
- A method is just a function that Python can call in a particular way
- You can call it directly if you want to
p = point2d()
p.setX(9.0)
p.setY(2.0)
print "calling getX directly gives", point2d.getX(p)
calling getX directly gives 9.0
- Comment on this slide
Defining a Queue
- A queue can do three things:
- Enqueue an item
- Dequeue the front item
- Which raises an exception if the queue is empty
- Remember, error behavior is part of a class's interface too
- Report whether it is empty or not
- Which is less powerful than reporting its size
- A queue may store items in many different ways
- The simplest in Python is probably to use a list…
- …but it isn't the only one…
- …and whoever is using the class shouldn't care
- At least, not unless speed and/or memory requirements become an issue
- Implementation:
class Queue(object):
'''Implement a queue.'''
def __init__(self):
'''Construct an empty queue.'''
self.data = []
def enq(self, val):
'''Add a new value to the end of the queue.'''
self.data.append(val)
def deq(self):
'''Return the front value of the queue, or raise ValueError if
the queue is empty.'''
if len(self.data) == 0:
raise ValueError, "Can't dequeue from empty queue"
val = self.data[0]
self.data = self.data[1:]
return val
def empty(self):
'''Report whether the queue is empty or not.'''
return len(self.data) == 0
- Use docstrings instead of comments to document classes and methods (and functions)
- Unlike comments, they are kept at runtime to provide interactive help
print Queue.__doc__
print Queue.enq.__doc__
print q.deq.__doc__
Implement a queue.
Add a new value to the end of the queue.
Return the front value of the queue, or raise ValueError if
the queue is empty.
raise ValueError instead of using assert when someone tries to dequeue from an empty queue- The operation isn't a flaw in this code (which is what assertions are for)
- Question: should
point2d.setX and point2d.setY have raised ValueError as well?
- Always include tests
if __name__ == '__main__':
q = Queue()
assert q.empty()
try:
x = q.deq()
assert False
except ValueError:
pass
q.enq('first')
assert not q.empty()
q.enq('second')
assert q.deq() == 'first'
assert q.deq() == 'second'
assert q.empty()
- We'll see in the next two lectures how to do this systematically
- Comment on this slide
Special Methods
__init__ is just one example of a special method
- All have names beginning and ending with double underscore
- Give programmers a way to make their data types look like those built into Python
- Most widely used is probably
__str__
- When Python needs a string representation of an object, it sees if the object's class defines
__str__ - If it does, Python calls it, and uses the string it returns
- If it doesn't, Python creates a default string representation
class first(object):
def __init__(self, name):
self.name = name
class second(object):
def __init__(self, name):
self.name = name
def __str__(self):
return 'second/"%s"' % self.name
f = first("orange")
print "first object is", f
s = second("apple")
print "second object is", s
first object is <__main__.first object at 0x401d982c>
second object is second/"apple"
- Other examples:
- If
obj is an object, len(obj) calls obj.__len__()
- Easy to add to
Queue to report how long the queue is
x in obj is equivalent to obj.__contains__(x)
- Would also be easy to add to
Queue, but should we? - Don't want to create “kitchen sink” classes
class Queue(object):
def __init__(self):
self.data = []
def enq(self, val):
self.data.append(val)
def deq(self):
if len(self.data) == 0:
raise ValueError, "Can't dequeue from empty queue"
val = self.data[0]
self.data = self.data[1:]
return val
def __len__(self):
return len(self.data)
def empty(self):
return self.__len__() == 0
if __name__ == '__main__':
q = Queue()
q.enq('first')
q.enq('second')
assert len(q) == 2
q.deq()
q.deq()
assert q.empty()
- Comment on this slide
Inheritance
- What if we want two types of queue, one with
__len__ and one without? - More realistically, suppose we have two types of people: professors and students
- They have a lot in common: names, email addresses, etc.
- But each has things the other doesn't (e.g., an office, or courses taught vs. courses taken)
- Really don't want to write each method twice
- Anything repeated in two or more places will eventually be wrong in at least one.
- The object-oriented solution is inheritance
- One class extends its parent class by adding new state and methods
- Anything instances of the parent class can do, instances of the child class can do as well
- But not vice versa
- Extend any class with any number of child classes
- All children share the behavior of their parent
- But are completely independent of one another
- For example, define a
Person as:
class Person(object):
'''Represent generic traits of any person.'''
def __init__(self, name, email):
'''Create a new person with a name and an email address.'''
self.name = name
self.email = email
def getName(self):
return self.name
def getEmail(self):
return self.email
- Then derive
Professor:
class Professor(Person):
'''Professors are more than people...'''
def __init__(self, name, email, office):
Person.__init__(self, name, email)
self.office = office
def getOffice(self):
return self.office
- Note that it's
Professor(Person) instead of Professor(object)
- Signals that
Professor extends Person instead of the generic built-in object class
- Note also that
Professor's constructor calls Person's- Ensures that all of
Person's initialization is done without duplicating code - Ensures that changes to
Person will automatically be reflected in Person
- Now define
Student:
class Student(Person):
'''Students are more than people too...'''
def __init__(self, name, email, program):
Person.__init__(self, name, email)
self.program = program
def getProgram(self):
return self.program
- And run a few tests:
if __name__ == '__main__':
tb = Professor('Tycho Brahe', 't.brahe@copenhagen.dk', 'Room 241')
print tb.getName(), tb.getEmail(), tb.getOffice()
jk = Student('Johannes Kepler', 'j.kepler@uraniborg.org', 'B.A.')
print jk.getName(), jk.getEmail(), jk.getProgram()
Tycho Brahe t.brahe@copenhagen.dk Room 241
Johannes Kepler j.kepler@uraniborg.org B.A.
- Comment on this slide
Polymorphism
- Suppose we define
Person.__str__, then override it with more specific definitions in Professor and Student
- When Python looks for a method, it checks the object's class, then its parent class, and so on
![[Method Lookup]](img/oop/polymorphism.png)
Figure 12.2: Method Lookup |
- Result is that Python calls the most appropriate method for an object
class Person(object):
def __init__(self, name, email):
self.name = name
self.email = email
...
def __str__(self):
return '%s (%s)' % (self.name, self.email)
class Professor(Person):
def __init__(self, name, email, office):
Person.__init__(self, name, email)
self.office = office
...
def __str__(self):
result = Person.__str__(self)
result += ' Room %s' % self.office
return result
class Student(Person):
def __init__(self, name, email, program):
Person.__init__(self, name, email)
self.program = program
...
def __str__(self):
result = Person.__str__(self)
result += ' Program %s' % self.program
return result
if __name__ == '__main__':
crowd = [
Professor('Tycho Brahe', 't.brahe@copenhagen.dk', 'Room 241'),
Student('Johannes Kepler', 'j.kepler@uraniborg.org', 'B.A.')
]
for person in crowd:
print person
Tycho Brahe (t.brahe@copenhagen.dk) Room Room 241
Johannes Kepler (j.kepler@uraniborg.org) Program B.A.
- Polymorphism means “having more than one form”
- In object-oriented programming, it means being able to treat different types of things as interchangeable
- E.g., if every class derived from
Shape defines a method area, then obj.area() will return the area of obj without the caller having to know if obj is actually a square or a circle
- Comment on this slide
The Substitution Principle
- Python looks up each method when it's called
- So any two classes that define the same set of methods can be used interchangeably
- However, the usual way to implement polymorphism is via inheritance
- Only override things that need to change
- The Liskov Substitution Principle states that it should always be possible to replace an instance of a parent class with an instance of the child class
- Means that if
Parent.meth returns a string, Child.meth must also return a string - If
Parent.meth returns a non-empty string with at least three vowels, Child.meth must as well
- Often expressed in terms of pre-conditions and post-conditions
Child.meth must satisfy all the post-conditions of Parent.meth, and may impose more- In other words,
Child.meth's possible output is a subset of Parent.meth's - So any code that works correctly using the output of
Parent.meth will continue to work if given an instance of Child instead
Child.meth may ignore some of Parent.meth's pre-conditions, but may not impose more- Equivalently,
Child.meth accepts everything thatParent.meth did, and possibly more - So any code that could call
Parent.meth correctly is guaranteed to call Child.meth correctly too
- Comment on this slide
Class Members
- Sometimes want to share data between all instances of a class
- Constants, a count of the number of class instances created, etc.
- Any data members defined inside the
class block belong to the class as a wholeclass Counter(object):
num = 0 # Number of Counter objects created.
def __init__(self, name):
Counter.num += 1
self.name = name
if __name__ == '__main__':
print 'initial count', Counter.num
first = Counter('first')
print 'after creating first object', Counter.num
second = Counter('second')
print 'after creating second object', Counter.num
initial count 0
after creating first object 1
after creating second object 2
- Can also define methods that belong to the class, rather than to particular objects
- Out of scope for this course, but often useful
- Comment on this slide
Overloading Operators
- (Almost) every feature of Python objects can be overridden by defining the right special method
- If done carefully, makes code much easier to read
- If done haphazardly, makes it almost impossible to understand
- One common use is operator overloading
- The symbol
"+" is just a shorthand for a function that takes two arguments - If a class defines a method
__add__, it takes precedence over the built-in “function”
- So
x + y calls x.__add__(y)
- Of course, it's not quite that simple
2 + x and x + 2 don't always do the same thing- String concatenation
- Matrix multiplication
- Classes can define right-hand versions of operators, e.g.,
__radd__ instead of __add__
- If
x has an __add__ method, call that - Otherwise, if
y has an __radd__ method, call that - Otherwise, try Python's built-ins
- Example: a vector is sparse if most of its entries are zero
- No point padding eleven actual values with nine million zeroes
- So use a dictionary to record non-zero values and their indices
- Addition: create a new vector with a non-zero value wherever either operand had a non-zero value
- Dot product: add up products of matching non-zero values
- Length: return one more than the index of the largest non-zero value
- “One more” to be consistent with Python's lists
- Hm…what is the length of
v after:
v = SparseVector() # all values initialized to 0.0
v[27] = 1.0 # length is now 28
v[43] = 1.0 # length is now 44
v[43] = 0.0 # is the length still 44, or 28?
- This isn't a programming question; it's a data modeling question
- What does the data in the program have to do to model the real world accurately?
- The easy stuff
- Construction creates an empty sparse vector
- Its length is the largest index ever seen
- May or may not be sensible, but it's easy to implement
class SparseVector(object):
'''Implement a sparse vector. If a value has not been set
explicitly, its value is zero.'''
def __init__(self):
'''Construct a sparse vector with all zero entries.'''
self.values = {}
def __len__(self):
'''The length of a vector length is the largest index ever seen.'''
if len(self.values):
return max(self.values.keys())
return 0
- Making it look like a vector
- Subscripting with
[] is just a binary operator tooobj[index] is implemented as obj.__getitem__(index)obj[index] = value is obj.__setitem(index, value)- Note: all programming languages do this—the only question is whether they let you see it
def __getitem__(self, key):
'''Return an explicit value, or 0.0 if none has been set.'''
if key in self.values:
return self.values[key]
return 0.0
def __setitem__(self, key, value):
'''Assign a new value to a vector entry.'''
self.values[key] = value
- Dot product
- The other object (on the right side of
"*") is usually called other - No reason to insist that it be a sparse vector
- Could equally well be a list of values
- So loop over the indices of our own values, multiply by corresponding values in other object
- Any index not encountered in this loop doesn't matter, since it corresponds to something that's zero
def __mul__(self, other):
'''Calculate cross product of a sparse vector with something else.'''
result = 0.0
for k in self.values.keys():
result += self.values[k] * other[k]
return result
def __rmul__(self, other):
return self.__mul__(other)
- Note:
__rmul__ = __mul__ makes __rmul__ point to the same code as __mul__
- There's no magic here: methods are functions, are objects, are things you can assign to names
- Addition
- Trickier than multiplication: result is non-zero wherever either argument is non-zero
- Don't want to loop over all the zeroes of either argument
- Solution: if the other object is a sparse vector, cheat
- I.e., reach inside it, and rely on details of its implementation
def __add__(self, other):
'''Add something to a sparse vector.'''
# Initialize result with all non-zero values from this vector.
result = SparseVector()
result.values.update(self.values)
# If the other object is also a sparse vector, add non-zero values.
if isinstance(other, SparseVector):
for k in other.values.keys():
result[k] = result[k] + other[k]
# Otherwise, use brute force.
else:
for i in range(len(other)):
result[i] = result[i] + other[i]
return result
# Right-hand add does the same thing as left-hand add.
__radd__ = __add__
- And finally, some tests
if __name__ == '__main__':
x = SparseVector()
x[1] = 1.0
x[3] = 3.0
x[5] = 5.0
print 'len(x)', len(x)
for i in range(len(x)):
print '...', i, x[i]
y = SparseVector()
y[1] = 10.0
y[2] = 20.0
y[3] = 30.0
print 'x + y', x + y
print 'y + x', y + x
print 'x * y', x * y
print 'y * x', y * x
z = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
print 'x + z', x + z
print 'x * z', x * z
print 'z + x', z + x
- Comment on this slide
Structured Unit Testing
A Unit Testing Framework
- Unit testing follows a pattern
- Setup and teardown
- Lots of small, independent tests
- Reporting
- Aggregation (combine tests into a test suite, and test suites into larger suites)
- See a pattern, build a framework
- Write shared code once
- Encourage people to work a certain way
- I.e., make it easy for them to do things right
- JUnit is a testing framework originally written by Kent Beck and Erich Gamma in 1997
- Widely adopted
- Now integrated into almost all Java IDEs
- Made testing easy enough that programmers actually started doing it
- Widely imitated:
- Workalikes are available C++, Perl, .NET, etc.
- Once you know one, you can easily learn and use the others
- Widely extended
- Add-ons for measuring test execution times, recording tests, testing web applications, etc.
- Comment on this slide
Mechanics
- Python's standard unit testing framework is the
unittest module - What you do:
import unittest- Derive a class (or several classes) from
unittest.TestCase - Define one method for each test
- Name must begin with
"test" - Takes no argument (other than
self) - Returns nothing
- Call
unittest.main()
- What the framework does:
- Search the program to find classes derived from
TestCase
- See how this works in a later lecture
- Search each class to find methods whose names begin with the word
"test" - For each method:
- Create a new instance of the class
- Call the method
- Record pass, fail, or error
- Actually check things inside test methods using methods provided by
TestCase
- Allows the framework to distinguish between test assertions, and normal
assert statements- Since the code being tested might use the latter
- Checking methods include:
assert_(condition): check that something is true (note the underscore)assertEqual(a, b): check that two things are equalassertNotEqual(a, b): the reverse of the aboveassertRaises(exception, func, …args…): call func with arguments (if provided), and check that it raises the right exceptionfail(): signal an unconditional failure
- Comment on this slide
Testing a Function
- You want to test a function that calculates a running sum of the values in the list
- Replaces
[a, b, c, …] with [a, a+b, a+b+c, …] - Test cases:
- Empty list
- Single value
- Long list with mix of positive and negative values
- Hm… Is it supposed to:
- Return a new list?
- Modify its argument in place and return that?
- Modify its argument and return
None?
- Your tests can only ever be as good as (your understanding of) the spec
- We'll assume it's supposed to return a new list
- First implementation of
runningSum:
import sys, unittest
def runningSum(seq):
result = seq[0:1]
for i in range(2, len(seq)):
result.append(result[i-1] + seq[i])
return result
class SumTests(unittest.TestCase):
def testEmpty(self):
self.assertEqual(runningSum([]), [])
def testSingle(self):
self.assertEqual(runningSum([3]), [3])
def testDouble(self):
self.assertEqual(runningSum([2, 9]), [2, 11])
def testLong(self):
self.assertEqual(runningSum([-3, 0, 3, -2, 5]), [-3, -3, 0, -2, 3])
if __name__ == '__main__':
unittest.main()
F.E.
======================================================================
ERROR: testLong (__main__.SumTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "running_sum_wrong.py", line 21, in testLong
self.assertEqual(runningSum([-3, 0, 3, -2, 5]), [-3, -3, 0, -2, 3])
File "running_sum_wrong.py", line 6, in runningSum
result.append(result[i-1] + seq[i])
IndexError: list index out of range
======================================================================
FAIL: testDouble (__main__.SumTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "running_sum_wrong.py", line 18, in testDouble
self.assertEqual(runningSum([2, 9]), [2, 11])
File "c:\Python23\lib\unittest.py", line 302, in failUnlessEqual
raise self.failureException, \
AssertionError: [2] != [2, 11]
----------------------------------------------------------------------
Ran 4 tests in 0.000s
FAILED (failures=1, errors=1)
- Notice that test methods deserve to have meaningful names too
- Can you spot the error?
- Default output gives the stack trace, to help you find the problem
- You can also provide messages to the
TestCase.assert family of methods
- Tests are run in an arbitrary order
- Classes store methods in a dictionary for fast lookup
- Means that there's no guarantee they'll be run in the order in which you defined them
- Yet another good reason to make each test independent
- Fix the method and re-run the tests
def runningSum(seq):
result = seq[0:1]
for i in range(1, len(seq)):
result.append(result[i-1] + seq[i])
return result
....
----------------------------------------------------------------------
Ran 4 tests in 0.000s
OK
- Everything works
- Should you really go to this much effort to test a simple function?
- Took less than a minute to write the four tests
- Uncovered one gap in the requirements, and one error in the first implementation
- Able to verify the fix almost instantly
- Sounds pretty good to me…
- Comment on this slide
Eliminating Redundancy
- Setting up a fixture can often be more work than writing the test
- Complex data structures, external files, etc.
- If the test class defines a method called
setUp, the framework will call it before running each test- And if there's a method called
tearDown, it will be run after each test
- Example: test whether a function can trace Make-like dependencies correctly
- Several kinds of tests to run, and several fixtures to run them on
- So create all the fixtures in
setUp, and use them in the test methods- Alternatively, create a separate test class for each fixture
- Worth doing if creating a fixture takes so long that you don't want any redundant setup
class TestDepends(unittest.TestCase):
def setUp(self):
self.f0 = {}
self.f1 = {'a' : []}
self.f2 = {'a' : ['a']}
self.f3 = {'a' : ['b'], 'b' : []}
self.f4 = {'a' : ['b', 'c'], 'b' : ['a'], 'c' : ['d'], 'd' : []}
def testCanReachSelf(self):
self.assert_(dependsOn(self.f1, 'a', 'a'))
self.assert_(dependsOn(self.f2, 'a', 'a'))
self.assert_(dependsOn(self.f3, 'b', 'b'))
self.assert_(dependsOn(self.f4, 'c', 'c'))
def testSingleStep(self):
self.assert_(dependsOn(self.f3, 'a', 'b'))
self.assert_(dependsOn(self.f4, 'a', 'b'))
self.assert_(dependsOn(self.f4, 'c', 'd'))
def testBackward(self):
self.assert_(dependsOn(self.f4, 'b', 'a'))
def testMultiStep(self):
self.assert_(dependsOn(self.f4, 'a', 'd'))
self.assert_(dependsOn(self.f4, 'b', 'd'))
def testNegative(self):
self.assert_(not dependsOn(self.f1, 'a', 'c'))
self.assert_(not dependsOn(self.f3, 'b', 'a'))
- Comment on this slide
Testing for Failure
- Testing that code fails in the right way is just as important as testing that it does the right thing
- Otherwise, someone will do something wrong some day, and the code won't report it
- Two ways to do it:
- Use
TestCase.assertRaises to check that a specific function raises a specific exception - Use
try/except yourself- Have to do it this way in C++, Java, and most other sturdy languages
- Example: testing a function that finds all values inside a double-ended range
- Raises
ValueError if the range is empty, or if the set of values is empty class TestInRange(unittest.TestCase):
def testNoValues(self):
try:
inRange([], 0.0, 1.0)
except ValueError:
pass
else:
self.fail()
def testBadRange(self):
try:
inRange([0.0], 3.0, 3.0)
inRange([0.0], 4.0, -2.0)
except ValueError:
pass
else:
self.fail()
- Comment on this slide
Testing I/O
- Input and output often seem hard to test
- Put a bunch of fixture files and expected results in a testing directory?
- Create temporary files when and as tests execute?
- The best answer is to use fake I/O using strings
- Python's
StringIO and cStringIO modules can read and write strings instead of files - Similar packages exist for C++, Java, and other languages
- This only works if the function being tested takes streams as arguments, rather than filenames
- If the function opens and closes the file, no way for you to substitute a fake file
- An example of designing code to make it testable
- Example: find lines where two files differ
- Nothing as fancy as
diff, just a line-by-line comparison - Input: two streams (which might be open files or
StringIO wrappers around strings) - Output: another stream (i.e., a file, or a
StringIO) - As a side effect, we've made the function itself more useful
- People can now use it to compare strings, or strings and files
class TestDiff(unittest.TestCase):
def wrapAndRun(self, left, right, expected):
left = StringIO(left)
right = StringIO(right)
actual = StringIO()
diff(left, right, actual)
self.assertEqual(actual.getvalue(), expected)
def testEmpty(self):
self.wrapAndRun('', '', '')
def testLengthyMatch(self):
str = 'a\nb\nc\n'
self.wrapAndRun(str, str, '')
def testSingleLineMismatch(self):
self.wrapAndRun('a\n', 'b\n', '1\n')
def testMiddleMismatch(self):
self.wrapAndRun('a\nb\nc\n', 'a\nx\nc\n', '2\n')
- Comment on this slide
Testing With Classes
- Testing classes isn't much different from what we've already seen
- Create objects, call their methods, check the results
- But how can we keep tests independent?
- Can't test methods unless we can construct objects…
- …but the only way to see if objects have been constructed properly is to invoke their methods
- The answer is to keep tests as independent as possible
- The first tests you write use simple constructors, and simple inspector methods
- Once those pass, start writing more complex tests
- They probably won't be executed in this order once they're all written…
- …but they still give you somewhere to set a breakpoint when simple things start to go wrong
- Much more interesting to think about how to use object-oriented ideas in constructing tests
- Create one class that has a lot of tests
- Derive other test classes that:
- Create specific fixtures
- Run specific tests
- By default, the framework will run all the generic tests for the specific classes as well
![[Object-Oriented Testing]](img/unit/oo_tests.png)
Figure 13.1: Object-Oriented Testing |
- This only works if the tests in the base class are generalized
- A little more setup work for a big payoff
- Example: billing application for a software consulting firm
- Each consultant may be assigned to zero or more projects
- Each project may have zero or more consultant currently assigned to it
- The
Assignment class keep track of who is currently assigned to what- Use strings for consultant and project IDs for now
- Create a base class
CommonTests to hold shared tests- Do not derive this from
unittest.TestCase - If we did, the framework would automatically try to run it…
- …and it's not designed to be run on its own
- This class does not create a fixture
- That's the job of the derived class
- Add a test
- Each test records the state of the fixture…
- …does something…
- …and then checks that the fixture is in the right final state
- Remember, the fixture is recreated from scratch for each test
class CommonTests(object):
'''Tests that can be applied to all fixtures.'''
def testAddNew(self):
'''Check that adding a single new assignment works.'''
priorConsultants = self.fixture.getConsultants()
priorProjects = self.fixture.getProjects()
self.fixture.addAssignment('Alan', 'tables')
self.assertEqual(self.fixture.getByConsultant('Alan'),
Set(['tables']))
self.assertEqual(self.fixture.getByProject('tables'),
Set(['Alan']))
self.assertEqual(self.fixture.getConsultants(),
Set(['Alan']) | priorConsultants)
self.assertEqual(self.fixture.getProjects(),
Set(['tables']) | priorProjects)
- Derive another class from both
CommonTests and unittest.TestCase
- The former ensures that it inherits all the common test cases
- The latter ensures that the framework will find it, and know how to run it
- This class creates a fixture, and adds a few fixture-specific tests
class TestEmpty(unittest.TestCase, CommonTests):
'''Test an empty assignment table.'''
def setUp(self):
'''Create the empty assignment table.'''
self.fixture = Assignment()
def testNoConsultants(self):
'''Make sure that nonexistent consultants aren't present.'''
self.assertEqual(self.fixture.getByConsultant('nobody'), Set())
def testNoProjects(self):
'''Make sure that nonexistent projects aren't present.'''
self.assertEqual(self.fixture.getByProject('nothing'), Set())
- Run it
- Framework finds
TestEmpty - Goes through the test cycle (test, setup, run, teardown) for each
testXYZ method in it, and in CommonTests
- Now add another shared test to
CommonTests:
def testAddDelNew(self):
'''Check that adding and then removing an assignment
leaves things as they were.'''
priorConsultants = self.fixture.getConsultants()
priorProjects = self.fixture.getProjects()
self.fixture.addAssignment('Alan', 'tables')
self.fixture.delAssignment('Alan', 'tables')
self.assertEqual(self.fixture.getConsultants(), priorConsultants)
self.assertEqual(self.fixture.getProjects(), priorProjects)
- And/or derive another test class:
class TestMulti(unittest.TestCase, CommonTests):
'''Set up common tests with multiple assignments already present.'''
def setUp(self):
self.fixture = Assignment()
self.fixture.addAssignment('Bhargan', 'dishes')
self.fixture.addAssignment('Harald', 'dishes')
self.fixture.addAssignment('Harald', 'floors')
self.fixture.addAssignment('Rachel', 'floors')
self.fixture.addAssignment('Rachel', 'counters')
self.fixture.addAssignment('Sally', 'counters')
- Multiplicative effect: (common tests × derived classes) + specific tests
- Comment on this slide
Test-Driven Development
- With test-driven development (TDD), you write your tests, then write the code
- A key element of agile development methodologies like eXtreme Programming…
- …but it was around long before they were…
- …and pays off in almost every context
- Seems backward, but has very real advantages:
- Gives programmers a definite goal
- Coding is finished when all tests run
- Particularly useful when trying to fix bugs in old code, as it forces you to figure out how to recreate the bug
- Helps prevent the “one more feature” syndrome
- Ensures that tests actually get written
- People are often too tired, or too rushed, to test after coding
- Helps clarify the API before it is set in stone
- If something is awkward to test, it can be redesigned before it's written
- A great way to communicate requirements
- I write the tests
- “All” you have to do is write code that passes those tests
- Example: I want you to write a function that:
- A single rectangle (represented as an instance of class
Rect) as its first argument - And a list of rectangles (each also an instance of
Rect) as its second argument - And returns
True if the rectangles in the list completely cover the first rectangle, or False if they don't - Best way for me to specify everything I want is to write some test cases
class TestOverlay(unittest.TestCase):
def testNoRects(self):
self.assert_(not overlay(Rect(0, 0, 1, 1), []))
def testWithSelf(self):
r = Rect(0, 0, 1, 1)
self.assert_(overlay(r, [r]))
def testHalfOnly(self):
self.assert_(not overlay(Rect(0, 0, 2, 2),
[Rect(0, 0, 1, 2)]))
def testTwoHalves(self):
self.assert_(overlay(Rect(0, 0, 2, 2),
[Rect(0, 0, 1, 2), Rect(1, 0, 2, 2)]))
- Comment on this slide
Exercises
Exercise 13.1:
Python has another unit testing module called doctest.
It searches files for sections of text that look like interactive
Python sessions, then re-executes those sections and checks the
results. A typical use is shown below.
def ave(values):
'''Calculate an average value, or 0.0 if 'values' is empty.
>>> ave([])
0.0
>>> ave([3])
3.0
>>> ave([15, -1.0])
7.0
'''
sum = 0.0
for v in values:
sum += v
return sum / float(max(1, len(values)))
if __name__ == '__main__':
import doctest
doctest.testmod()
Convert a handful of the tests you have written for other
questions in this lecture to use doctest. Do you prefer it
to unittest? Why or why not? Do you think doctest
makes it easier to test small problems? Large ones? Would it be
possible to write something similar for C, Java, Fortran, or
Mathematica?
Automated Builds
How Do You Rebuild A Program?
- Most programming languages require you to compile programs before running them
- E.g., turn C source files into machine-specific object files
- Which usually have a
.o extension
- It's easy enough to create
overlap.o from overlap.c: just run cc overlap.c - But what if you have many source files?
cc *.c is slow and inefficient if only one of your 117 source files has changed
- Worse, what if some files rely on others?
- Suppose the code in
overlap.c uses code in rectangle.c - If you change both, but only recompile
overlap.c, you'll be using out-of-date rectangles
- Comment on this slide
Automate, Automate, Automate
- Remember rule #2: anything worth repeating is worth automating
- Computers are good at details, and don't get bored, so have them do repetitive tasks
- We need:
- A way to describe the tasks (what things to do)
- A way to specify the dependencies between them (when to do things)
- Most widely used tool for this is called Make
- Invented in 1975 by Stuart Feldman at Bell Labs
[Feldman 1979]
- The good news is that Make will:
- Figure out what has changed
- Work out what is affected by those changes
- Execute commands to bring things up to date (e.g., by recompiling)
- Commands are handed to the shell for execution, so they're the same ones you would run yourself
- The bad news is Make's syntax
- Initially very simple, but has grown more complex over time
- Has turned into a little programming language in its own right (rule #11)
- We will ignore the complex bits for now, and take a look at a better way to solve these problems in Backward, Forward, and Sideways
- Comment on this slide
Our Example
- Make doesn't care what language you use
- In fact, it doesn't even care if you're using it to recompile things
- Often used to run tests, keep web sites up to date, etc.
- Any task where one action depends on others is a good candidate
- Running example: Sakura Synthesis is studying organic fullerene production
- Automated laboratory equipment runs experiments in batches to create files like this:
Time: 1.2271
Concentration: 0.0050
Yield: 11.41
Time: 2.5094
Concentration: 0.0055
Yield: 11.20
Time: 3.7440
Concentration: 0.0060
Yield: 10.90
- Each experiment produces 20-30 files
- May eventually have several thousand of them
- Want to:
- Generate summary files showing the results for particular trials in tabular form
- Update a file showing the correlation between concentrations and yields based on those tabular summaries
- Comment on this slide
Hello, Make
- Suppose we already have a program called
dat2csv.py that reformats data as a table - Put the following into a file called
hello.mk:
hydroxyl_422.csv : hydroxyl_422.dat
python dat2csv.py hydroxyl_422.dat > hydroxyl_422.csv
- Note: that indentation must be a tab character, not eight spaces, or a mix of spaces and tabs
- Yeah, it's a wart, but we're stuck with it
- Some terminology:
![[Structure of a Make Rule]](img/make/make_structure.png)
Figure 14.1: Structure of a Make Rule |
hydroxyl_422.csv is the target of the rulehydroxyl_422.dat is its prerequisite
- We'll see in a moment that a target may have many prerequisites
- The compilation command is the rule's action
- A rule may also have many actions
- Run
make -f hello.mk
- Make sees that the CSV file depends on the data file
- Since the CSV file doesn't exist, Make executes the rule's action (i.e., runs
dat2csv.py)
- Run
make -f hello.mk againhydroxyl_422.csv is newer than its prerequisite, so the action is not executed
- Comment on this slide
Multiple Targets
- Very few programs consist of a single file
- This Makefile describes two files that are independent of one another
hydroxyl_422.csv : hydroxyl_422.dat
python dat2csv.py hydroxyl_422.dat > hydroxyl_422.csv
methyl_422.csv : methyl_422.dat
python dat2csv.py methyl_422.dat > methyl_422.csv
- When you run
make -f double.mk, only hydroxyl_422.csv is compiled- The first target in the Makefile is the default target
- Unless you tell it otherwise, that's all Make will update
- Run
make -f double.mk methyl_422.csv to build methyl_422.csv - Comment on this slide
Phony Targets
- Obviously don't want to have to run Make separately for each possible target
- That would hardly count as “automation”
- Solution: create a phony target that:
- Depends on all the things you want to recompile, but
- Doesn't correspond to any files, and so is never up to date
- Making it therefore always executes its actions
all : hydroxyl_422.csv methyl_422.csv
hydroxyl_422.csv : hydroxyl_422.dat
python dat2csv.py hydroxyl_422.dat > hydroxyl_422.csv
methyl_422.csv : methyl_422.dat
python dat2csv.py methyl_422.dat > methyl_422.csv
- Now,
make -f phony.mk all recreates both .csv files
- Typical phony targets in a software project:
"all": recompile everything"clean": delete all temporary files, and everything produced by compilation"install": copy files to system directories- Many open source projects are installed by typing
make, and then make install
- Comment on this slide
Automatic Variables
- Make defines automatic variables to represent parts of rules
- These variables are automatically redefined for each rule
- Names are unfortunately very cryptic
- Keep a reference card handy
"$@" | The rule's target |
"$<" | The rule's first prerequisite |
"$?" | All of the rule's out-of-date prerequisites |
"$^" | All prerequisites |
Table 14.1: Automatic Variables in Make
- Example
all : hydroxyl_422.csv methyl_422.csv
hydroxyl_422.csv : hydroxyl_422.dat
python dat2csv.py $< > $@
methyl_422.csv : methyl_422.dat
python dat2csv.py $< > $@
- Note: by default, Make echoes actions before executing them
- Putting
"@" at the start of the action line prevents this
- Comment on this slide
Pattern Rules
- Most files of similar type (Java, C++, etc.) are compiled the same way
- Rule #2 again: write a pattern rule specifying what to do in the general case
- Use the wildcard
"%" to represent the stem of the file's name in the target and prerequisites- Must use automatic variables in the actions
- This is why they were invented
all : hydroxyl_422.csv methyl_422.csv
%.csv : %.dat
python dat2csv.py $< > $@
clean :
@rm -f *.csv
- Notice that there's another phony target in this Makefile
make -f rule.mk clean will get rid of all the generated files- The
-f flag stands for “force”
- Tells
rm not to report an error when trying to delete files that don't exist
- Comment on this slide
Dependencies
- Next step: create an overall summary for each treatment
- E.g., combine
hydroxyl_422.csv and hydroxyl_480.csv to create hydroxyl_all.csv, and similarly for the methyl files, using summarize.py
hydroxyl_all must be rewritten if it is older than either hydroxyl_422.csv or hydroxyl_480.csv
- Which must in turn be rewritten if they're newer than their corresponding
.dat files
- In this case, also make the results depend on
summarize.py and dat2csv.py
- If the programs change, there's a good chance the data file format has changed
- Or a bug has been found and fixed
- Updated Makefile, and its output when no CSV files exist, are:
all : hydroxyl_all.csv methyl_all.csv
%_all.csv : %_422.csv %_480.csv summarize.py
python summarize.py $^ > $@
%.csv : %.dat dat2csv.py
python dat2csv.py $< > $@
clean :
@rm -f *.csv
python dat2csv.py hydroxyl_422.dat > hydroxyl_422.csv
python dat2csv.py hydroxyl_480.dat > hydroxyl_480.csv
python summarize.py hydroxyl_422.csv hydroxyl_480.csv > hydroxyl_all.csv
python dat2csv.py methyl_422.dat > methyl_422.csv
python dat2csv.py methyl_480.dat > methyl_480.csv
python summarize.py methyl_422.csv methyl_480.csv > methyl_all.csv
rm hydroxyl_480.csv methyl_422.csv hydroxyl_422.csv methyl_480.csv
- Notice that Make automatically removes intermediate files at the end
- Often visualize dependencies as a directed graph
- Dependencies can have dependencies, and so on
![[Visualizing Dependencies]](img/make/visualize_depend.png)
Figure 14.2: Visualizing Dependencies |
- Mark all the files that are out of date, and follow arrows to find out what will happen
- Note: Make can execute actions in any order it wants to, as long as it doesn't violate dependency ordering
- Notice that the rule
%_all.csv takes precedence over the rule %.csv
- Make uses the most specific rule available
- Comment on this slide
Defining Macros
- Often want to define values inside a Makefile
- E.g., the directory programs are to send their output to
- Or optimization flags for a compiler
- Do not want to set these options over and over again
- Rule #3: anything repeated in two or more places will eventually be wrong in at least one
- Solution: define variables (usually called macros)
- E.g., specify the directory where the lab machine puts data files:
INPUT_DIR = /lab/gamma2100
all : hydroxyl_all.csv methyl_all.csv
%_all.csv : %_422.csv %_480.csv
python summarize.py $^ > $@
%.csv : ${INPUT_DIR}/%.dat
python dat2csv.py $< > $@
clean :
@rm -f *.csv
- Define by specifying name and value
- To get value, put a
"$" in front of the name, and parentheses or braces around it- Without the parentheses, Make will interpret
"$XYZ" as the value of the variable "X", followed by the characters "YZ" - Yes, it's ugly
- Sometimes useful to pass values into Make when invoking it
- E.g., change the input directory
- Rather than rewriting the Makefile each time, you can specify
name=value pairs on the command line- Define a macro with the default value
- Override it when you want to
- So:
make -f macro.mk sets INPUT_DIR to /lab/gamma2100- But
make INPUT_DIR=/oldcrankymachine -f macro.mk uses /oldcrankymachine
- Make also looks at environment variables
- You can refer to
${HOME} in a Makefile without having defined it - Example:
VAL = original
echo :
@echo "VAL is" ${VAL}
$ make -f env.mk VAR=changed echo
- Comment on this slide
Analysis
- Pro
- Simple things are simple to do…
- …and not too difficult to read…
- …especially compared to the alternatives
- Con
- The syntax is unpleasant
- Complex things are difficult to read…
- …and even more difficult to debug
- Best you can do is use
echo to print things as Make executes
- Not really very portable
- Hands commands to the shell for execution
- But commands use different flags on different operating systems
- Do you use
del or rm to delete files?
- Alternatives
- Ant
- primarily for Java
- Less platform-dependent…
- …but just as hard to read and debug
- Integrated development environments
- Most are proprietary and platform-specific
- SCons
- Give users a library to manage dependencies and actions
- “Makefile” is actually a small program
- More powerful, and debuggable, but steeper learning curve
- Once builds are automated, the next step is to run them continuously
- Every time someone checks something into version control, rebuild the software (or site), and re-run tests
CruiseControl and Bitten will both handle the details
- Comment on this slide
Exercises
Exercise 14.1:
How can you stop Make from removing intermediate files
automatically when it finishes processing?
Exercise 14.2:
Make gets definitions from environment variables,
command-line parameters, and explicit definitions in Makefiles.
What order does it check these in?
Coding Style and Reading Code
Introduction
- Good programming style is very hard to teach
- All the “rules” are either banal or overly restrictive
- Saying, “Make methods short, but not too short” doesn't really help
- Neither does saying, “No method shall be longer than sixty lines”
- Matters aren't helped by the fact that the strength of programmers' opinions is inversely proportional to the amount of data they have
- But good style is tremendously important
- Well-written code is less likely to contain errors
- And is much easier to maintain
- Rather than teaching you how to write well, this lecture will focus on how to read code
- How to figure out how the code works
- How to find what you need to know before making a change
- Hope that you'll pick up ideas about good style along the way
- For a comprehensive guide to how open source Unix applications are (or should be) structured, try
[Spinellis 2003]
- Comment on this slide
Why Read Code?
- People doing creative work in almost every field routinely inspect, dissect, and critique what's come before
- Would you hire an architect who'd never looked at anyone else's buildings?
- Programming is a sad exception
- Most students only ever read one- or two-page fragments in textbooks
- Like reading sonnets, then being asked to write a novel
- Open source has made millions of lines of code available for inspection
- Some good, some bad, most average
- Knowing how to read code is as useful as knowing how to read a proof
- Learn by example how to accomplish specific things
- Figure out how to make changes in a particular program
- Figure out how to use the program
- Code inspections are essential for achieving and maintaining quality
- Comment on this slide
Seven Plus or Minus
- Known since the 1950s that human short term memory can hold 7ŷ2 items
[Hock 2004]
- Seven random digits (as in phone numbers)
- Seven tasks that still have to be done
- Note: this is just an average
- Some prodigies can remember thousands of pieces of information for short periods of time
- If we try to remember more than that, we either:
- Make mistakes, or
- Create chunks so that we can remember things at a higher level
- Musicians don't remember individual chords; they remember IV-V-I progression
- Chess players remember “castled kingside”, rather than the positions of five pieces
![[Chunking in Short-Term Memory]](img/readstyle/castling.png)
Figure 15.1: Chunking in Short-Term Memory |
-
[Chase & Simon 1973]
studied what happened when novice and master chess players are shown actual and random positions
- Masters (naturally) remember actual positions better
- But they remember random positions worse
- They're “seeing” patterns that aren't there
![[Actual Chess Position]](img/readstyle/chess_actual.png)
Figure 15.2: Actual Chess Position |
| ![[Retention of Actual Chess Position]](img/readstyle/chess_actual_graph.png)
Figure 15.3: Retention of Actual Chess Position |
|
![[Random Chess Position]](img/readstyle/chess_random.png)
Figure 15.4: Random Chess Position |
| ![[Retention of Random Chess Position]](img/readstyle/chess_random_graph.png)
Figure 15.5: Retention of Random Chess Position |
|
- What does this have to do with programming?
- When you are reading and writing code, you have to keep a bunch of facts straight in your head for a short period of time
- What is this function supposed to return?
- What parameters does it have, and what do they mean?
- What does this loop's index refer to?
- Etc.
- The more odds and ends you require readers (including yourself) to keep track of, the greater their error rates will be
- Fundamental rule of good programming style is therefore to limit the amount of information needed at any one time
- Note that a hundred one-line methods is not an improvement over one hundred-line method
- A related rule: the greater the difference, the more likely we are to notice it
- So every difference in naming and layout ought to mean something…
- …because readers' brains will assume that it does
- Conversely, every significant difference ought to be visually distinct
- Comment on this slide
What Does This Function Do?
- Start with a simple question: how to figure out what this code does?
# Fill in citations.
def bibRef(root, bibOutput, bibInfo):
cites = root.getElementsByTagName('cite')
for cite in cites:
ref = str(cite.getAttribute('ref'))
if (not ref) or (ref not in bibInfo):
print "Cannot satisfy '%s'" % ref
assert False
cite.setAttribute('display', bibInfo[ref])
cite.setAttribute('href', '%s#%s' % (bibOutput, ref))
- First question: what's the context?
- In this case, it's part of a program to fill in citations
<cite ref="castro-xml"/> becomes <cite display="Castro 2002" href="bib.html#castro-xml" ref="castro-xml"/>
- Ways to answer the question:
- Guess based on function name
- Assume
bibRef stands for “bibliographic reference” - But it isn't a particularly informative name
- Read embedded documentation (such as comments)
- “Fill in citations.” is more helpful than
bibRef - Still doesn't tell us what
root, bibOutput, and bibInfo are - Or what side effects (if any) this function has
- Read the function body
- For each citation element in
body… - …get the value of the element's
ref attribute… - …make sure it's in
bibInfo (presumably a dictionary)… - …set the citation's
display and href attributes - Hm…that means
bibOutput must be the name of the output file - Notice how the explanation creates larger chunks from raws lines of code
- Consult external documentation
- In this case, there doesn't appear to be any
- Examine call(s) to the function
- First job is to find it (or them)
grep bibRef expandrefcite.py finds exactly one call # Substitute information.
for filename in args:
filename, _, _, root = getFile(filename)
genTime(root)
crossRef(root, xrefInfo)
bibRef(root, bibOutput, bibInfo)
outfile = open(filename, 'w')
outfile.write(root.toxml('utf-8'))
outfile.close()
- Looking higher up, see that
bibOutput is initialized to bib.html, so our guess was pretty good
- You use these techniques (consciously or otherwise) whenever you are reading code
- The purpose of good style is to improve the odds of them working
- A good way to think about size and scope is the notion of a program slice
- The subset of names in scope at a particular statement that are needed to understand what that statement does
- If the slice is much smaller than the method itself, the method is probably bloated
- Comment on this slide
Naming
- Names of files, classes, methods, variables, and other things are the most visible clue to purpose
- A variable called
temperature probably doesn't store a user name or the number of pottery shards found at a dig site
- Should therefore choose names that are both meaningful and readable
current_surface_temperature_of_probe is meaningful, but not readablecstp is easier to read, but hard to understand
- The smaller the scope of a name, the more compact it can be
- It's OK to use
i and j for indices in tightly-nested for loops - But not OK if the loop bodies are several pages long
- Of course, they shouldn't be anyway…
- The wider the scope of a name, the more descriptive it has to be
- Call a class
ExperimentalRecord, rather than ER or ExpRec
- Compare:
| Before | After |
|---|
import reader,splitter,transpose
import sys,os
a=[]
b=[]
c=[]
d=sys.argv[1]
a=reader.readLinesFromFile(d)
b=splitter.splitIntoSections(a)
c=d.split('.')
for i in range(len(b)):
if os.path.isfile('%s.%d.dat'%(c[0],i+1)):
print '%s.%d.dat already exists and cannot be overwritten!'%(c[0],i+1)
break
else:
output=file('%s.%d.dat'%(c[0],i+1),'w')
print>>output,transpose.transpose(b[i])
output.close()
| import sys, os
import reader, splitter, transpose
inputFileName = sys.argv[1]
lines = reader.readLinesFromFile(inputFileName)
sections = splitter.splitIntoSections(lines)
fileNameStem = inputFileName.split('.')[0]
for i in range(len(sections)):
outputFileName = '%s.%d.dat' % (fileNameStem, i+1)
if os.path.isfile(outputFileName):
print '%s already exists and cannot be overwritten!' % outputFileName
break
else:
output = file(outputFileName, 'w')
print >> output, transpose.transpose(sections[i])
output.close()
|
- Use naming conventions to indicate purpose
- Constants written
ALL_UPPER_CASE_WITH_UNDERSCORES - Class names written using
CamelCase with first letter capitalized- Name “camel case” comes from the humps in the middle
- Parameters, methods, local variables, etc., use
camelCase with first letter in lower case
- Many other conventions make just as much sense
- Write parameters, etc., using underscores, as in
current_maximum - Write class members like
this_, _this, or m_this
- Not as useful in Python, where you have to refer to members as
self.this anyway - But useful in C++, Java, and other languages
- Use an agreed list of abbrevations
currAveTemp instead of currentAverageTemperature- “Agreed” is important: mixing
currAvTemp and currAveTemp produces hard-to-spot bugs
- Most important thing is to be consistent
- Anything consistent is readable after a while
- Just watch kids learning to read French, Punjabi, and Korean
- But your brain automatically notices visual differences, and assumes they're important
- It actually does take longer to read code written in a mix of styles…
- …and the error rate is higher
- Conform to
PEP-008: Python Style Guide unless you have a really good reason not to - Comment on this slide
Idioms
- Every language (human or computer) has idioms
- Sayings or phrases whose meaning does not follow from the meaning of the words it is made up of
- E.g., “face the music” does not mean “look at the orchestra”
- A way to aggregate raw data into chunks
- As you read programs, look for uses of common idioms
- Example: C/C++ and Java allow assignment in the middle of an expression
z = (y = x) + 2 assigns the value of x to y, then adds 2 to that and assigns it to z
- No one would (or should) ever use it in the manner show above, but it makes certain kinds of loops more compact
| Without Idiom | Using Idiom |
|---|
line = infile.readLine();
while (line != null) {
...do something with line...
line = infile.readLine();
}
| while ((line = infile.readLine()) != null) {
...do something with line...
}
|
- Assign next line (if any) to
text, then check for end of input, all in the top of the loop
- This can be even more compact in C and C++, which treat null as equivalent to false
while (line = infile.readLine()) {
...do something with line...
}
- But Python doesn't allow assignment in the middle of expressions
- And in older versions, you couldn't iterate directly over a file using
for line in input: - Solution was an idiom that looked like an infinite loop, but had a
break statement in it to handle end-of-input while 1:
line = infile.readline()
if not line:
break
...do something with line...
- Note: using 1 instead of
True, since older Pythons didn't define True either
- You can figure out what the code is doing even if you don't know the idiom…
- …but knowing the idiom makes it a lot simpler for your brain to chunk what it's reading
- Style guideline: use your language's idioms
- If everyone else writes their code a certain way, write yours that way too
- I never liked Python's
while 1: break idiom, but I used it
- Two ways to learn idioms:
- Books on programming style (e.g., the “Effective” series from Addison-Wesley)
- Reading other people's code
- Note: idioms evolve over time
- Languages gain new features, programmers learn new tricks, etc.
- People who have grown up with Java tend to write cleaner C than people who learned it as a first language
- So when reading code, prefer the style of newer code over that of older stuff
- Comment on this slide
Style Tools
- A growing number of tools exist to check and enforce style guidelines
- Look for things that the language spec and compilers allow, but which make code harder to read
PyLint
- Parses programs to create an annotated syntax tree
- A data structure that represents classes, loops, variables, etc.
![[Annotated Syntax Tree]](img/readstyle/ast.png)
Figure 15.6: Annotated Syntax Tree |
- Then searches the tree to find things that match problem patterns
- Each pattern implemented as a class that looks for certain things in the tree
- Which means that users can write new checkers if they want to
class Metal:
name = "aluminum"
def __init__(self):
self.name = arg
def meth(self, val):
self.v = val
pylint badcode.py
************* Module badcode
W: 8: Bad indentation. Found 6 spaces, expected 8
W: 0: Missing docstring
W: 0: Missing required attribute "__revision__"
W: 1:Metal: Missing docstring
E: 5:Metal.__init__: Undefined variable 'arg'
W: 7:Metal.meth: Missing docstring
R: 1:Metal: Not enough public methods (1/2)
W: 8:Metal.meth: Attribute 'v' defined outside __init__
PyChecker
- Imports the module (or modules)
- If the code doesn't run,
PyChecker can't analyze it
- Uses compiled byte codes as well as source when doing analysis
pychecker badcode.py
Warnings...
badcode.py:5: No global (arg) found
- Similar tools exist for [name of your language goes here]
- Most check for much more than just indentation and naming conventions
- Can also find:
- Dead code (i.e., variables and methods that are never used)
- Unhandled exceptions
- Duplicate code
- And much more
- Use them during development
- Don't check anything into version control unless it passes a style check
- Take their warnings very seriously when debugging
- If someone was sloppy enough to make style mistakes, they probably made more serious mistakes as well
- Use them when reading code
- Many tools can reformat code so that it obeys style guidelines
- Do not reformat code just because you can
- Makes it difficult to track and merge changes in version control
- But reformatting a copy to make it easier to read can save you headaches
- Comment on this slide
What About Documentation?
- In an ideal world, all code would be accurately documented
- And the Beatles would still be performing as a group…
- In this world, documentation is often missing
- Or worse, it may have been up to date three years ago, but no longer describes the software correctly
- Many kinds of documentation are useful
- Embedded in code to explain tricky bits
- “This is a while loop, instead of a for, because it may delete items from the list as it goes”
- Associated with class/method headers to explain what they're for
- Documentation for a library
- “This function returns the first pair of non-overlapping subsequences that match the input pattern, or null otherwise”
- Architectural descriptions
- The architecture of a program is what you draw on the whiteboard when explaining its internals to other people
- May show relationships between classes, major modules, data flow, etc.
- Often the most important documentation of all, since it's what gives other programmers a mental map of how everything fits together
- Requirements
- Description of what needs the software is supposed to meet
- “If more than 100 events arrive in a second, the seismograph interface must store them in a queue“ tells you to look for a seismograph interface, and a queue that feeds it data
- User guide
- As above, knowing what the software is supposed to do often helps figure out how it works
- It's increasingly fashionable to embed some documentation in code, then extract and format it
- Javadoc (for Java) translates specially-formatted comments into HTML
| Java | HTML |
|---|
/**
* Returns the least common ancestor of two species based on DNA
* comparison, with certainty no less than the specified threshold.
* Note that getConcestor(X, X, t) returns X for any threshold.
*
* @param left one of the base species for the search
* @param right the other base species for the search
* @param threshold the degree of certainty required
* @return the common ancestor, or null if none is found
* @see Species
*/
public Species getConcestor(Species left, Species right, float threshold) {
...implementation...
}
| <p><strong>getConcestor</strong></p>
<p><code>public Species getConcestor(Species left, Species right, float threshold)</code></p>
<blockquote>
<p>Returns the least common ancestor of two species based on DNA
comparison, with certainty no less than the specified threshold. Note
that getConcestor(X, X, t) returns X for any threshold.</p>
<p><strong>Parameters:</strong></p>
<blockquote>
<p><code>left</code> - one of the base species for the search
<p><code>right</code> - the other base species for the search
<p><code>threshold</code> - the degree of certainty required
</blockquote>
<p><strong>Parameters:</strong></p>
<blockquote>
<p>the common ancestor, or null if none is found</p>
</blockquote>
<p><strong>See Also:</strong></p>
<blockquote>
<p><code>Image</code></p>
</blockquote>
</blockquote>
|
- Python's
PyDoc module does similar things
- Embedded documentation is more likely to be up to date than external documentation
- It's right in front of programmers as they make changes
- Although that's no guarantee…
- Comment on this slide
Traceability
- Reproducibility is one of the things that makes science science
- Anyone can (at least in principle) repeat your experiment to verify your results
- Traceability is a key part of this in computational science
- What program(s) did you use?
- Where did your data come from, and what did you do to it?
- Always include version control information in source files
- Subversion can be told to look for strings like
$Revision: 421$ when you submit changes, and replace the text after the colon with the appropriate information - Allows the file to carry its identity with it when it's printed, archived, or emailed
- Usually put the keyword inside a string, so that version information is available inside the program
__version__ = "$Revision: 423 $"
if __name__ == '__main__':
print __version__
- Data files are source files too!
- Every original data file should go under version control, and have revision information stored in it
- Can record author and date of last change as well as version number
- Carry this information through when processing files
- E.g., input file
biomes.dat has a header $Revision: 421$ - Is processed by version 37 of
ecoanalyzer.py - So include the following header (as comments) in the output file:
# $Revision: $
# From: biomes.dat version 421
# By: ecoanalyzer.py version 37
# On: 2005-09-22 12:14:07 EST
- Comment on this slide
Executable Documentation
- Also increasingly fashionable to talk about executable documentation
- Things that describe the program in ways that the computer can check
- Assertions
- Assertions at the top of a method are usually checking pre-conditions that must be true for the method to run properly
- Those at the bottom are guarantees of what will be true after the method has run
- Unless there are
return statements embedded in the middle of the method… - …which is another reason they're a bad idea
- Assertions in the middle of methods are usually checking:
- Invariants (i.e., number of items in the
searched list, plus the number in the toSearch list, stays constant) - Old bugs: common to introduce assertions while debugging, then leave them in to detect the bug's reappearance
- When reading code, assertions let you compare your current understanding of the code with what it actually does
- If you see
assert x < y, and you don't know why x has to be less than y at that point, you've lost track
- So when writing code, add assertions:
- To detect errors and mis-use early
- To help the next person understand how the code is supposed to behave
- Tests
- Each test is really just a super-assertion
- Each test that's up to date and runs correctly, that is
- They are not a substitute for documentation
- Which is easier to read, 23 test cases, or one sentence explaining the relationship between a method's three Boolean input parameters?
- But they are a definitive answer in cases of ambiguity
- When writing, use test-driven development to ensure that tests exist, and are kept up to date as code changes
- When reading, look for tests that use the classes or functions you're interested in, and make sure you understand why the given input produces the checked-for output
- Build
- Every medium to large project has infrastructure to compile source files, create installers, run tests, etc.
- Knowing how the software is built can tell you a lot about what to look for
- Entire directories of source code may be ignored depending on the operating system you're on
- Different modules may be used depending on whether it's the Lite or Professional version
- Build control files are notoriously difficult to read
- Often easier to perform a build, and look at its output to see what files are involved, what flags are used, etc.
- When writing code, try to keep the build as simple as possible
- If any parts of the software are only conditionally included, make the conditions very prominent
- If you're linking non-standard libraries, make sure they're flagged too
- Comment on this slide
Active Reading
- Sometimes, the only way to understand a piece of code is to get your hands dirty
- Logging
- Running in a debugger
- Dynamic code inspection tools
- Most languages have logging libraries that let programmers control debugging output
- Typical call is
log("GRID", LOG_DEBUG, "Refining a %d X %d grid", dimX, dimY) - First argument specifies topic (sometimes called “channel”)
- Allows readers to filter messages (e.g., select only those related to grid updates)
- Second specifies level: only print message if user has asked for that much detail
| Level | Meaning |
|---|
| DEBUG | Debugging output (of interest to programmers only) |
| INFO | Normal operations (e.g., a user logging in) |
| WARN | Something that requires an administrator's attention (e.g., a user is out of disk space) |
| ERROR | Something wrong with the program that it can recover from (e.g., unable to open its own configuration file) |
| CRITICAL | System about to crash, missiles about to launch, yet another bad 1970s TV show being turned into a movie |
Table 15.1: Typical Logging Levels
- At start of program, specify where to send logging output
- To screen
- To a file
- To a file that “rolls over” when it reaches a certain size
- To a database
- Through a socket to a collector on some other machine
- Etc.
- When writing code, add logging statements to record important events
- Set to “no output” by default to avoid annoying users
- When program starts to behave strangely, have users turn on logging and send you the trace
- Encourages cleaner coding
- If there's no point at which you can print, “Rotation complete”, it'll be hard for readers to figure out which code is responsible for doing rotations
- Debuggers aren't just for finding bugs
- Stepping through code is often the easiest way to figure out how it works
- First pass: step over functions, rather than into them
- Then re-run, and dig down to the next level of detail
- Hardest part is often figuring out where to start
- Depends on the architecture
- Command-line applications: start with
main (or, in Python, __name__ == '__main__') - If the code plugs into a larger framework, set breakpoints on key methods, then examine the call stack to see how you got there
- Some debuggers are more graphical than others
- Data Display Debugger (DDD) can draw box-and-arrow diagrams of data structures
![[Data Display Debugger]](img/readstyle/ddd.jpg)
Figure 15.7: Data Display Debugger
(Courtesy of Andreas Zeller, The DDD Manual)
|
- Integrated Development Environments
- A good IDE is more than just a fancy editor and a symbolic debugger
- Good ones (like Eclipse) also include code browsers
- Keep track of where classes, methods, and variables are defined
- Single click to jump from a variable to its definition, then to the definition of its class
- Another click to search for variables of a certain class (or its subclasses)
- And much, much more
- Similar tools available for old-fashioned text editors
- Ctags parses source files and produces a database of the definitions they contain
- Single editor command jumps from use to definition
- Currently works for over 30 languages
- Comment on this slide
Summary
- The most efficient way to read code is almost always to:
- Print it out
- Paper has 4-10 times the resolution of even the best screen
- Sit somewhere comfortable
- Away from email, the web, your pager, and other distractions
- Find an entry point (like
main) - Trace execution from there, skipping over stuff that looks straightforward
- E.g., argument parsing and file I/O
- Put a question mark beside everything that doesn't immediately make sense
- Once you're done, go back and try to answer your questions
- This also happens to be a good way to do a code review
- Read a colleague's code in order to:
- Find bugs
- Make sure she is following style guidelines
- Make sure she implemented what you asked for
- Code reviews are a more effective way to find bugs than running ad hoc tests
- Which is yet another good reason to learn how to read code
- Comment on this slide
Watching Programs Run
Turing's Great Insight
- Alan Turing's great insight was that programs are just another kind of data
- Source code is text that can be parsed line by line, using regular expressions, etc.
- Compiled programs are just bytes in memory
- Reflection is the ability of software to examine itself as it runs
- What methods does this class contain?
- What lines of code have been executed?
- How much time is being spent executing each line of code?
- It may sound very abstract, but:
- Every modern programming language provides it (to some extent)
- Many important tools rely on it
- A growing number of applications use it to make themselves more modular and easier to extend
- Will briefly describe features of other languages in order to explain things
- Trust me, it's not as scary as it first appears
- Comment on this slide
Faking Objects
- If Python didn't have objects and classes, you could fake them using its other features
- Seeing how will give you a clearer picture of how Python's own objects work
- Use a dictionary to store an object's members
- Including its methods…
- …which are just functions that take the object as their first argument
- So for each class, write a function that creates such a dictionary
def makeRect(x0, y0, x1, y1):
'''Construct a rectangle 'object'.'''
return {'x0' : x0,
'y0' : y0,
'x1' : x1,
'y1' : y1,
'area' : rectArea,
'contains' : rectContains,
'str' : rectStr
}
- Now write the “methods”
- Remember, each one will be passed the object as its first argument
def rectArea(obj):
'''Calculate area of rectangle.'''
return (obj['x1'] - obj['x0']) * (obj['y1'] - obj['y0'])
def rectContains(obj, x, y):
'''See if a rectangle contains an XY point.'''
return (obj['x0'] <= x <= obj['x1']) and \
(obj['y0'] <= y <= obj['y1'])
def rectStr(obj):
'''Show as string.'''
return '[%(x0)d,%(y0)d X %(x1)d,%(y1)d]' % obj
- Write a function to run an arbitrary method
- Look for it in the object's dictionary
- Pass it the object, and any other parameters given by the user
- Hand the result back to the original caller
def run(obj, meth, *args, **kwargs):
'''Run a method for an 'object', passing in arbitrary arguments.'''
func = obj[meth]
return func(obj, *args, **kwargs)
- Note: this function works for all “classes”
- Finally, write some tests to show how it works
# Create a rectangle.
block = makeRect(0, 0, 2, 2)
print run(block, 'str')
# Calculate its area.
print 'area is', run(block, 'area')
# See if it contains a couple of points.
print 'contains (1, 1)?', run(block, 'contains', 1, 1)
print 'contains (10, 10)?', run(block, 'contains', 10, 10)
- This is actually not very far from what Python actually does
- Replace
run(obj, 'method', …) with obj.method(…) - Let programmers define classes to group related functions together
- I.e., to make it clear that they're methods
- Classes also help implement inheritance
- Instances of a child class get all the methods of their parent class(es)
- If the child class redefines the method, the object gets a reference to the new one, rather than the old one
- Comment on this slide
How Other Languages Do It
- Java's object model is more restrictive than Python's
- All methods and members must be defined when code is compiled
- Can't be modified at runtime
- Less flexible, but makes automatic optimization easier
- Object stores its own data members, and a reference to the class it belongs to
- At runtime, the class is just another object
- Its members are its methods
obj.meth(args) is obj.class[methodname](args)
- I.e., look up the class, look in it to find the method, and call that
- C++, Fortran-90, and other languages are all variations on the theme
- Key idea: it's all functions underneath
- Object orientation is “just” a way to figure out which one to call
- Comment on this slide
Runtime Tricks
- So, what's all this good for?
- Well, how do you think a debugger works?
- Doesn't know anything about your code in particular
- So it uses reflection to look things up as it needs them
- Still useful even if you don't intend to write a debugger
hasattr(obj, attrName) reports whether an object has an attribute with a particular namegetattr(obj, attrName) gets the value of an object's attributesetattr(obj, attrName, newValue) set an attribute's value- Yes, you can add attributes to particular objects after they have been created
dir(obj) lists all the attribute of an object- All of these are very useful when debugging interactively
- Python's
import statement is evaluated at runtime as well- Unlike C++'s
#include or Java's import, which are evaluated by the compiler
- Can therefore decide at runtime which version of a class to load
- If running a production system, load the real one
- If testing, load a fake version
import sys
if (len(sys.argv) > 1) and (sys.argv[1] == '-d'):
import testLib as lib
else:
import realLib as lib
print lib.ackermann(3, 2)
- Take this one step further, and use information in a file to control what modules Python loads and calls
- Key idea: everything you load dynamically must obey some kind of protocol
- A set of rules that allow two entities to communicate without knowing anything else about each other
- “Everything reads text from standard input, and writes text to standard output” is an example
- You have to know something in order to use code…
- Example: want to apply a series of filters to a set of strings
- Just like Unix command line tools
- Protocol: every module must define a function called
func that takes a list of strings as input, and produces another list of strings as output def func(lines):
result = []
for i in range(1, len(lines), 2):
result.append(lines[i])
return result
- Load modules dynamically using the
__import__ function- Does what the
import statement does, but takes a string as an argument - Note: must not include the trailing
.py
- The whole thing is surprisingly small
import sys
filters = []
for moduleName in sys.argv[1:]:
moduleName = moduleName[:-3] # strip off '.py'
module = __import__(moduleName)
func = getattr(module, 'func')
filters.append(func)
print filters
lines = sys.stdin.readlines()
for f in filters:
lines = f(lines)
for l in lines:
print l,
- This is how systems like the Apache web server work
- Modules being loaded are written in C, rather than Python
- Which modules to load is specified in a complex configuration file, rather than on the command line
- But the principles are the same
- Makes it very easy to add new components, swap real components for debugging versions, etc.
- Can equally well use it to control what mesh and solver classes to load, or which sequence comparison tests to import
- Comment on this slide
Coverage
- As well as looking at code as a bunch of (static) bytes, we can watch what happens as it executes
- Very useful for debugging
- Even more useful for performance tuning
- When Python translates your program into bytecode, it keeps track of where instructions came from
- It can then make a record of which lines of code have been executed
- Typically accumulate data over several runs of the program
- Allows you to see which bits of your program are no longer being used
- More importantly, which bits aren't being tested
- The
coverage tool does four things:
coverage -x yourprogram.py …args… runs your program to collect coverage data- Data is accumulated in a file called
.coverage so that you can collect information over several runs of your program
- Example:
import sys
def foo(x, y):
for i in x:
y += 1
print 'result:', y
def bar(z):
if z < 0:
print "shouldn't happen"
elif z == 0:
print "shouldn't happen either"
else:
foo('abc', z)
if __name__ == '__main__':
for i in range(1, int(sys.argv[1])):
bar(i)
python coverage.py -x c.py 3
result: 4
result: 5
python coverage.py -r -m c.py
Name Stmts Exec Cover Missing
-------------------------------------
c 14 12 85% 10, 12
python coverage.py -a
> import sys
> def foo(x, y):
> for i in x:
> y += 1
> print 'result:', y
> def bar(z):
> if z < 0:
! print "shouldn't happen"
> elif z == 0:
! print "shouldn't happen either"
> else:
> foo('abc', z)
> if __name__ == '__main__':
> for i in range(1, int(sys.argv[1])):
> bar(i)
- Coverage is most useful when integrated with unit tests
- Usually not economical to aim for 100% code coverage
- Tests for simple get/set methods are as likely to be buggy as the methods themselves…
- …and some things (like the network cable being kicked out at exactly the wrong moment) really are hard to automate
- But very useful to know whether code coverage is going up or down
- If someone checks in a large change set, and code coverage immediately drops from 80% to 70%, you probably want to have a close look at those changes
- Comment on this slide
Profiling
- Rather than just flagging lines as they are executed, we could count how often each line executes
- The more often something runs, the more important it is to figure out how to make it fast
- If execution time is what we really care about, better to measure it directly
- A statement that takes 10 milliseconds to execute is ten thousand times more expensive than one than takes a microsecond
- Profiling is the act of measuring aspects of a program's performance
- Typically measure execution time
- Can also measure disk usage, memory footprint, etc.
- Two approaches
- Instrument the code by inserting instructions to record the clock time
- Typically record time at start and end of each method call, and total up the differences
- Deterministic, but expensive
- Sample the code by interrupting it at regular intervals to see where it is
- Builds a statistical profile of where in the code we're most likely to be
- Non-deterministic, but impact can be reduced by reducing sampling frequency, and running the program for longer
- Both kinds of profiling distort the program's behavior
- The Heisenberg Principle: measuring something changes it
- Python's
hotshot module measures execution time- Create an instance of
hotshot.Profile with the name of a log file as an argument- Profile data accumulates in that file
- Use the profile object's
runcall method to profile a single run of a function- Which may of course call other functions
- To process the data, use
hotshot.stats
- Can load data from log files, sort it in various ways, print it, etc.
- Example: compare speed of 1000 handwritten substring function calls with built-in
import sys
def builtin(str, substr):
return str.find(substr)
def handwritten(str, substr):
for i in range(0, len(str)-len(substr)):
match = True
for j in range(len(substr)):
if substr[j] != str[i+j]:
match = False
break
if match:
return i
return -1
def tests(num):
Tests = [
['a', ''],
['a', 'abc'],
['abc', 'a'],
['pqrstuvwxyz', 'stuv'],
['pqrstuvwxyz', 'stuw']
]
for i in range(num):
for (str, substr) in Tests:
expected = builtin(str, substr)
assert expected == handwritten(str, substr)
if __name__ == '__main__':
import hotshot, hotshot.stats
Filename = 'substr.prof'
num = int(sys.argv[1])
prof = hotshot.Profile(Filename)
prof.runcall(tests, num)
prof.close()
stats = hotshot.stats.load(Filename)
stats.strip_dirs()
stats.sort_stats('time', 'calls')
stats.print_stats()
10001 function calls in 0.292 CPU seconds
Ordered by: internal time, call count
ncalls tottime percall cumtime percall filename:lineno(function)
5000 0.165 0.000 0.165 0.000 substr.py:6(handwritten)
1 0.081 0.081 0.292 0.292 substr.py:17(tests)
5000 0.045 0.000 0.045 0.000 substr.py:3(builtin)
0 0.000 0.000 profile:0(profiler)
- Hm…built-in method is only two and a bit times faster?
- What does it cost for an empty function call?
def builtin(str, substr):
return 0
def handwritten(str, substr):
return 0
10001 function calls in 0.166 CPU seconds
Ordered by: internal time, call count
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.098 0.098 0.166 0.166 empty.py:11(tests)
5000 0.034 0.000 0.034 0.000 empty.py:7(handwritten)
5000 0.034 0.000 0.034 0.000 empty.py:4(builtin)
0 0.000 0.000 profile:0(profiler)
- Take off overhead, and the difference is closer to 4X
- Still seems low, but at least now we have a way to investigate quantitatively
- Comment on this slide
Summary
- Major advances in programming come when languages start offering support for things the best programmers are doing anyway
- For loops and if/then/else formalized what good Fortran programmers were already doing
- Objects formalized the way good C, Pascal, and Lisp programmers managed their data structures and functions
- Java deserves credit for bringing two previously-esoteric practices into the mainstream
- Garbage collection: the computer recycles memory as it needs to
- Reflection: programs can inspect themselves at runtime
- Reflection simplifies the construction of large software systems
- Most big applications are now frameworks that load plug-in components dynamically
- A little extra effort…
- …but it forces programmers to really, truly modularize their code…
- …which also reduces maintenance and customization costs
- Watching programs run is an essential part of the software development process
- Which parts of my code can be thrown away?
- Which parts am I actually testing?
- Which parts of their work are my colleagues actually testing?
- Why is my program so slow?
- Modern computer systems are so complex that it's practically impossible to figure this out from first principles
- So write the code, profile, and then start tuning
- Comment on this slide
Exercises
Exercise 16.1:
What percentage of your code is tests? Is tested?
Exercise 16.2:
Can you honestly say that you write tests before code?
Find out how many tests currently pass or fail with a single
command? Identify the tests associated with a bug? Tell if your
code meets the team's standards?
Exercise 16.3:
Can you find out which functions use the most CPU time?
How long threads spend blocked on I/O? Who allocates memory, where,
for what? How accurate these numbers are? How today's profile
differs from last month's? How the profile differs across
machines?
Regular Expressions
Introduction
- How to count the blank lines in a file?
- Most people consider a line with just spaces and tabs to be blank
- But examining characters one by one is tedious
- More complex patterns (like telephone numbers or email addresses) are hard to get right
- Solution: use regular expressions (REs) instead
- Represent patterns as strings
- Just like the
"*" in the shell's *.txt
- Warning: the notation is ugly
- Have to use what's on the keyboard, instead of inventing new symbols the way mathematicians do
- Can't use superscripts and subscripts
- Comment on this slide
A Simple Example
- Load the
re module, then use re.search(pattern, text)
pattern is a regular expression that describes what you're looking fortext is the string you're searching in
- So what goes in
pattern?
- Letters, digits, and whitespace character match against themselves
- Punctuation characters have special meaning
| Pattern | Matches | Doesn't Match | Explanation |
|---|
⌈a*⌋ | "", "a", "aa", … | "A", "b" | ⌈*⌋ means “zero or more” matching is case sensitive |
⌈b+⌋ | "b", "bb", … | "" | ⌈+⌋ means “one or more” |
⌈ab?c⌋ | "ac", "abc" | "a", "abbc" | ⌈?⌋ means “optional” (zero or one) |
⌈[abc]⌋ | "a", "b", or "c" | "ab", "d" | ⌈[…]⌋ means “one character from a set” |
⌈[a-c]⌋ | "a", "b", or "c" | Character ranges can be abbreviated |
⌈[abc]*⌋ | "", "ac", "baabcab", … | Operators can be combined: zero or more choices from "a", "b", or "c" |
Table 17.1: Basic Regular Expression Operators
![[Matching]](img/re/option_match.png)
Figure 17.1: Matching |
re.search looks for a match anywhere in the text- Doesn't have to match the entire target string
import re
pattern = 'a[bc]*'
for text in ['b', 'ab', 'accb', 'mad']:
if re.search(pattern, text):
print '"%s" matches "%s"' % (pattern, text)
else:
print '"%s" does not match "%s"' % (pattern, text)
"a[bc]*" does not match "b"
"a[bc]*" matches "ab"
"a[bc]*" matches "accb"
"a[bc]*" matches "mad"
⌈a[bc]*⌋ matches an "a", followed by zero or more of either "b" or "c"
- Doesn't match
"b" because there's no leading "a" - Matches
"ab" and "accb"
- Why does it match
"mad"?
re.search looks for a match anywhere in text- Skips the
"m", then ⌈a⌋ matches "a", and ⌈[bc]*⌋ matches the empty string
- Comment on this slide
Anchoring
- If
re.search looks anywhere in the line, how to find blank lines?
- Don't consider
"x \n" or " x\n" blank
- Constrain what the RE can match using anchors
⌈^⌋ matches the beginning of the string⌈$⌋ matches the end- Neither consumes any characters
![[Anchoring Matches]](img/re/match_anchor.png)
Figure 17.2: Anchoring Matches |
- Examples
| Pattern | Text | Result |
|---|
⌈b+⌋ | "abbc" | Matches |
⌈^b+⌋ | "abbc" | Fails (string doesn't start with b) |
⌈c$⌋ | "abbc" | Matches (string ends with c) |
⌈^a*$⌋ | aabaa | Fails (something other than "a" between start and end of string) |
Table 17.2: Anchoring Regular Expressions
- Can now count blank lines in a file
import sys, re
# Nothing but space, tab, carriage return, newline from start to end
pattern = '^[ \t\r\n]*$'
# Count matches in one file/stream.
def count(filename, instream):
count = 0
for line in instream:
if re.search(pattern, line):
count += 1
print '%s %d' % (filename, count)
# Only standard instream?
if len(sys.argv) == 1:
count('<stdin>', sys.stdin)
else:
for filename in sys.argv[1:]:
instream = open(filename, 'r')
count(filename, instream)
instream.close()
- Note: always behave like a polite command line filter
- If no arguments given, read from standard input
- Comment on this slide
Escape Sequences
- How to match against a literal
"^" or "*"?
- Escape from the character's special meaning by putting
"\" in front of it ⌈\$⌋ matches a literal "$", and ⌈\\⌋ matches a literal "\"
- Must write these in Python as
"\\$" and "\\\\"
- Two layers of compilation:
- Python turn double backslashes into single backslash character
- Regular expression library then compiles single backslash plus something into special operation
![[Compiling Regular Expressions]](img/re/re_compile.png)
Figure 17.3: Compiling Regular Expressions |
- This can get very confusing, very quickly
"\t" is a tab character, which matches a tab character- But
"\\t" is the two-character sequence ⌈\t⌋, which also matches a tab character
-
"\" is also used in shorthand notation for common character sets| Sequence | Equivalent | Explanation |
|---|
⌈\d⌋ | ⌈[0-9]⌋ | Digits |
⌈\w⌋ | ⌈[a-zA-Z0-9_]⌋ | Word characters (i.e., those allowed in variable names) |
⌈\s⌋ | ⌈[ \t\r\n]⌋ | Whitespace |
Table 17.3: Regular Expression Escape Sequences
- A couple of useful special cases:
- The notation
⌈[^abc]⌋ means “anything except the characters in this set” ⌈.⌋ means “any character except the end of line”
⌈\b⌋ anchors the match to a break between word and non-word characters- Like
⌈^⌋ and ⌈$⌋, doesn't consume any actual characters ![[Word/Non-Word Breaks]](img/re/word_nonword_break.png)
Figure 17.4: Word/Non-Word Breaks |
- Comment on this slide
Extracting Matches
- Problem: to check for duplicate line numbers in an assembly language file
- Line numbers are the first thing on the line, terminated by a colon
load D 1
10 : sub A B
jlt A 20
- First try: search each line with a regular expression
- If it succeeds, extract the line number
import sys, re
# start of line, optional spaces, digits, more optional spaces, colon
numbered = '^\\s*\\d+\\s*:'
seen = {}
for line in sys.stdin:
if re.search(numbered, line):
num = line.split()[0]
if num in seen:
print num
else:
seen[num] = True
- But what if there is no space between the line number and the colon?
'2 :'.split() gives ['2', ':'], but '2:'.split()' gives ['2:']- Don't want to search the string to find the longest leading sequence of digits
- Need a better way to extract the text that matched sub-patterns
- Result of
re.search is actually a match object that records what what matched, and wheremo.group() returns the whole string that matched the REmo.start() and mo.end() are the indices of the match's locationimport re
text = 'abbcb'
for pattern in ['b+', 'bc*', 'b+c+']:
mo = re.search(pattern, text)
print '%s / %s => "%s" (%d, %d)' % \
(pattern, text, mo.group(), mo.start(), mo.end())
b+ / abbcb => "bb" (1, 3)
bc* / abbcb => "b" (1, 2)
b+c+ / abbcb => "bbc" (1, 4)
- Every parenthesized subexpression in the RE is a group
- Group 0 is the entire match
- Text that matched Nth parentheses (counting from left) is group N
mo.group(3) is the text that matched the third subexpression, m.start(3) is where it started
- Extracting line numbers is now easy:
import sys, re
# start of line, optional spaces, digits, more optional spaces, colon
numbered = '^\\s*(\\d+)\\s*:'
seen = {}
for line in sys.stdin:
mo = re.search(numbered, line)
if mo:
num = mo.group(1)
if num in seen:
print num
else:
seen[num] = True
- So is reversing two columns of numbers:
# optional spaces, number, required spaces, number, optional spaces
def reverse(instream, outstream):
cols = '^\\s*(\\d+)\\s+(\\d+)\s*$'
for line in instream:
mo = re.match(cols, line)
# If match, reverse numbers
if mo:
a, b = mo.group(1), mo.group(2)
print >> outstream, '%s\t%s' % (b, a)
# If no match, echo line (without adding extra newline at end)
else:
print >> outstream, line,
- Let's not forget to test it:
if __name__ == '__main__':
fixture = '''\
# Leading comment followed by blank line
10 20
30\t40\t
50
60 70 80
\t90 100
'''
expected = '''\
# Leading comment followed by blank line
20\t10
40\t30
50
60 70 80
100\t90
'''
from cStringIO import StringIO
instream = StringIO(fixture)
outstream = StringIO()
reverse(instream, outstream)
assert outstream.getvalue() == expected
- Comment on this slide
Compiling
- The regular expression library compiles patterns into a more concise form for matching
- Each regular expression becomes a finite state machine
- Library follows the arcs in the FSM as it reads characters
- Drawing FSMs is a good way to debug REs
![[Regular Expressions as Finite State Machines]](img/re/re_fsm.png)
Figure 17.5: Regular Expressions as Finite State Machines |
- You can improve a program's performance by compiling the RE once, and re-using the compiled form
- Use
re.compile(pattern) to get the compiled RE - Its methods have the same names and behavior as the functions in the
re module - E.g.,
matcher.search(text) searches text for matches to the RE that was compiled to create matcher
- Example: find all Title Case words in a document
def findAll(instream, outstream):
matcher = re.compile('\\b([A-Z][a-z]*)\\b(.*)')
for line in instream:
mo = matcher.search(line)
while mo:
print >> outstream, mo.group(1)
mo = matcher.search(mo.group(2))
- Notice how the function gets all matches:
- Pattern captures what we want in group 1, and everything else on the line in group 2
- Each time there's a match, continue the search in the remainder captured in group 2
- Tests are straightforward
if __name__ == '__main__':
fixture = '''\
This has several "Title Case" words
on Each Line (Some in parentheses).
'''
expected = '''\
This
Title
Case
Each
Line
Some
'''
from cStringIO import StringIO
instream = StringIO(fixture)
outstream = StringIO()
findAll(instream, outstream)
assert outstream.getvalue() == expected
print 'INPUT'
print fixture
print 'OUTPUT'
print expected
import re
#- start:findAll
def findAll(instream, outstream):
matcher = re.compile('\\b([A-Z][a-z]*)\\b(.*)')
for line in instream:
mo = matcher.search(line)
while mo:
print >> outstream, mo.group(1)
mo = matcher.search(mo.group(2))
#- end:findAll
# start:test
if __name__ == '__main__':
fixture = '''\
This has several "Title Case" words
on Each Line (Some in parentheses).
'''
expected = '''\
This
Title
Case
Each
Line
Some
'''
from cStringIO import StringIO
instream = StringIO(fixture)
outstream = StringIO()
findAll(instream, outstream)
assert outstream.getvalue() == expected
print 'INPUT'
print fixture
print 'OUTPUT'
print expected
# end:test
- Compiled REs have many other useful methods
- Including one that finds all matches, so you don't have to write the loop
- All are also available as top-level functions in the
re module | Method | Purpose | Example | Result |
|---|
split | Split a string on a pattern. | re.split('\\s*,\\s*', 'a, b ,c , d') | ['a', 'b', 'c', 'd'] |
findall | Find all matches for a pattern. | re.findall('\\b[A-Z][a-z]*', 'Some words in Title Case.') | ['Some', 'Title', 'Case'] |
sub | Replace matches with new text. | re.sub('\\d+', 'NUM', 'If 123 is 456') | "If NUM is NUM" |
Table 17.4: Regular Expression Object Methods
- Comment on this slide
Using REs in Other Languages
- Like sine and sorting, regular expressions are independent of language
- RE libraries exist for C/C++, Java, Perl, Ruby, MATLAB, …
- Syntax varies slightly, but the ideas are the same
- Example: Java's
java.util.regex package contains two classes:
Pattern: a compiled regular expressionMatcher: the result of a match- Typical usage:
public static String matchMiddle(String data) {
String result = null;
Pattern p = Pattern.compile("a(b|c)d");
Matcher m = p.matcher(data);
if (m.matches()) {
result = m.group(1);
}
return result;
}
- REs are actually built into languages like Perl and Ruby
- Just as dictionaries are built into Python, and arrays into MATLAB
- Typical Perl usage:
open MAIL, 'mail.txt'
while (<MAIL>) {
if (($name, $value) = /^([^:]+): ?(.+)$/) {
print "Message header $name is $value\n";
}
}
- Comment on this slide
But Wait, There's More
- We've only scratched the surface
- Regular expressions have proved to be too useful to remain clean and elegant
- Use
⌈|⌋ for either/or⌈ab|cd⌋ matches either "ab" or "cd"⌈a(b|c)d⌋ matches either "abd" or "acd"
- Use
⌈pat{N}⌋ to match exactly N occurrences of a pattern- More generally,
⌈pat{M,N}⌋ matches between M and N occurrences ⌈\d{2,3}⌋ matches "19" or "207", but not "3" or "4567"
- Well, OK, it will match the first three characters of
"456" - But
⌈^\d{2,3}⌋ won't
- Most important thing is to build up complex REs one step at a time
- Write something that matches part of what you're looking for
- Test it
- Add to it
- That's why they call it computer science: it's experimental
- For a broader tutorial, see
[Wilson 2005]
- Comment on this slide
Exercises
Exercise 17.1:
By default, regular expression matches are
greedy: the first term in the RE
matches as much as it can, then the second part, and so on. As a
result, if you apply the RE ⌈X(.*)X(.*)⌋ to the string
"XaX and XbX", the first group will contain "aX and Xb",
and the second group will be empty.
It's also possible to make REs match
reluctantly, i.e., to have the
parts match as little as possible, rather than as much. Find out
how to do this, and then modify the RE in the previous paragraph
so that the first group winds up containing "a", and the
second group " and XbX".
Basic XML and XHTML
Overview
- XML is quickly becoming the standard way to store data
- Web pages
- Spreadsheets
- Images
- Astronomical observations
- Bewildering variety of tools for dealing with it
- And more appearing every day
- Even simple things can be tricky to do right
- Generally agreed that standards are more complex than they should have been…
- …but no agreement on which bits should have been left out
- Examine:
- The rules for creating legal XML
- The XML-compliant dialect of HTML (for web pages)
- The standard Python library for manipulating XML
- Note: there are lots of others, each with its strengths
- Reading:
- Comment on this slide
History
- 1969-1986: Standard Generalized Markup Language (SGML)
- Developed by Charles Goldfarb and others at IBM
- A way of adding information to medical and legal documents so that computers could process them
<person role="litigant">
<given-name>Charles</given-name>
<surname>Babbage</surname>
</person>
- Allows computers to reliably find cases in which Charles Babbage sued someone
- Large (500-page spec) and complex
- 1989: Tim Berners-Lee creates HyperText Markup Language (HTML) for the World Wide Web
- Much (much) simpler than SGML
- Anyone could write it, so everyone did
- Problem: HTML had a small, fixed set of tags
- Everyone wanted to add new ones
- Solution: create a standard way to define a set of tags, and the relationships between them
- 1998: first version of XML is standardized
- A set of rules for defining markup languages
- Much more complex than HTML…
- …but still much simpler than SGML
- New version of HTML called XHTML was also defined
- Like HTML, but obeys all XML rules
- Still a lot of non-XML compliant HTML out there, though
- Comment on this slide
Formatting Rules
- For our purposes, an XML document contains elements and text
- Full spec allows for external entity references, processing instructions, and other fun
- Elements are shown using tags
- Must be enclosed in angle brackets
"<>" - Full form:
<tagname>…</tagname> - Short form (if the element doesn't contain anything):
<tagname/>
- Elements must be properly nested
- If Y starts inside X, Y must end before X ends
- So
<X>…<Y>…</Y></X> is legal… - …but
<X>…<Y>…</X></Y> is not
- Every document must have a single root element
- I.e., a single element must enclose everything else
- So the following is not a legal document
<first>
This document is <em>illegal</em>
</first>
<second>
because it does not have a unique root element.
</second>
- Text is normal printable text
- Must use escape sequences to represent
"<" and ">" - In XML, written
&name;
| Sequence | Character |
|---|
< | < |
> | > |
" | " |
& | & |
Table 18.1: Common XML Escape Sequences
- Specific dialects of XML may or may not restrict which elements can appear inside which others, and where text can appear
- XHTML is very liberal
- MathML (Mathematical Markup Language) is stricter
- Comment on this slide
XHTML
- Most common use of XML is still XHTML (the XML version of hypertext)
- Basic tags:
html | Root element of entire HTML document. |
|---|
body | Body of page (i.e., visible content). |
|---|
h1 | Top-level heading. Use h2, h3, etc. for second- and third-level headings. |
|---|
p | Paragraph. |
|---|
em | Emphasized text; browser or editor will usually display it in italics. |
|---|
address | Address of document author (also usually displayed in italics). |
|---|
Table 18.2: Common XHTML Tags
- Note: XHTML includes both semantics (“What does this mean?”) and display (“How should this be drawn?”)
h1 (level-1 heading) is semantic, i (italics) is display- Now generally considered a bad thing
- Documents should only contain semantics
- Display of that semantics should be specified separately…
- …so that different browsers (or devices) can do it differently
- Sample document:
<html>
<body>
<h1>Software Carpentry</h1>
<p>This course will introduce <em>essential software development skills</em>,
and show where and how they should be applied.</p>
<address>Greg Wilson (gvwilson@third-bit.com)</address>
</html>
|
![[Simple Page Rendered by Firefox]](img/xml/simple_page_firefox.png)
Figure 18.1: Simple Page Rendered by Firefox |
|
![[Simple Page Rendered by Internet Explorer]](img/xml/simple_page_ie.png)
Figure 18.2: Simple Page Rendered by Internet Explorer |
|
- Comment on this slide
Attributes
- Elements may also have attributes
- Each attribute is a name/value pair that provides extra information about the element
- Enclosed in the opening tag
<h1>A Centered Heading</h1><p>This planet provided as-is.</p>
- Each name may appear at most once
- Like keys in a dictionary
<p>…</p> is illegal
- Values must be quoted
- Old-style HTML often allowed things like
<p>…</p>, but modern parsers will reject it - Must use escape sequences for angle brackets, quotes, etc. inside values
- Strictly speaking, attributes are redundant
- Can always re-write XML using only elements
- Usually more typing…
- …but that only matters if you're creating the XML by hand…
- …and an increasing amount is created by machines, for machines
| With Attributes | Without Attributes |
|---|
<a b="c">
<d e="f"/>
</a>
|
<a>
<a-b>c</a-b>
<d><d-e>f</d-e></d>
</a>
|
- You should use attributes when:
- Each value can occur at most once for any element.
- The order of the values doesn't matter.
- Those values have no internal structure.
- If you have to do any significant work on an attribute's value to figure out what it means, you should use an element instead.
- Comment on this slide
More XHTML Tags
- Well-written HTML pages have a
head element as well as a body
- Contains metadata about the page
| Element | Example | Purpose |
|---|
title | <title/> | Page title (for display in title bar of browser, bookmarks, etc.) |
meta | <meta/> | Information about the document (typically, to help with search) |
Table 18.3: More XHTML Tags
- Well-written pages also use comments (just like code)
- Introduce with
<!--, and end with --> <html>
<head>
<title>Comments Page</title>
<meta name="author" content="aturing"/>
</head>
<body>
<!-- House style puts all titles in italics -->
<h1><em>Welcome to the Comments Page</em></h1>
<!-- Update this paragraph to describe the forum. -->
<p>Welcome to the Comments Forum.</p>
</body>
</html>
- Many other tags can be used (and abused) in HTML pages
- Use
ul for an unordered (bulleted) list, and ol for an ordered (numbered) one- Each list item is wrapped in
li
- Use
table for tables- Each row is wrapped in
tr (for “table row”) - Within each row, column items are wrapped in
td (for “table data”) - Note: tables are often used to force multi-column layout, as well as for tabular data
<html>
<head>
<title>Lists and Tables</title>
<meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/>
</head>
<body>
<table cellpadding="3" border="1">
<tr>
<td align="center"><em>Unordered List</em></td>
<td align="center"><em>Ordered List</em></td>
</tr>
<tr>
<td align="left" valign="top">
<ul>
<li>Hydrogen</li>
<li>Lithium</li>
<li>Sodium</li>
<li>Potassium</li>
<li>Rubidium</li>
<li>Cesium</li>
<li>Francium</li>
</ul>
</td>
<td align="left" valign="top">
<ol>
<li>Helium</li>
<li>Neon</li>
<li>Argon</li>
<li>Krypton</li>
<li>Xenon</li>
<li>Radon</li>
</ol>
</td>
</tr>
</table>
</body>
</html>
| ![[Lists and Tables]](img/xml/list_table_firefox.png)
Figure 18.3: Lists and Tables |
|
- Note how Subversion keywords have been put in
meta elements in document head- Automatically updated each time the document is committed to version control
- Comment on this slide
Connecting to Other Data
- How to put an image in a page?
- XML documents can only contain text, so you can't store an image or audio clip directly in a page
- Unless you encode it as text
- Usual solution is to store a reference to the external file using the
img tag- The
src argument specifies where to find the image file
<html>
<head>
<title>Images</title>
<meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/>
</head>
<body>
<h1>Our Logo</h1>
<img src="../../img/swc_logo.jpg" alt="[Software Carpentry Logo]"/>
</body>
</html>
| ![[Images in Pages]](img/xml/image.png)
Figure 18.4: Images in Pages |
|
- Important to always use the
alt attribute to specify alternative text- Screen readers for people with visual handicaps use this instead of the image
- And it's good documentation for search engines
- Often, the “other data” you want to connect to is other HTML pages
- This is what makes it hypertext…
- Use the
a element to create a link- The text inside the element is displayed and (usually) underlined for clicking
- The
href attribute specifies what the link is pointing at - Both local filenames and URLs are supported
<html>
<head>
<title>Links</title>
<meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/>
</head>
<body>
<h1>A Few of My Favorite Places</h1>
<ul>
<li><a href="http://www.google.com">Google</a></li>
<li><a href="http://www.python.org">Python</a></li>
<li><a href="http://www.nature.com/index.html">Nature Online</a></li>
<li>Examples in this lecture:
<ul>
<li><a href="comments.html">Comments</a></li>
<li><a href="image.html">Images</a></li>
<li><a href="list_table.html">Lists and Tables</a></li>
</ul>
</li>
</ul>
</body>
</html>
| ![[Links in Pages]](img/xml/links.png)
Figure 18.5: Links in Pages |
|
- Comment on this slide
Accessibility
- The web is not a particularly friendly place if you're visually disabled
- Screenreaders have a hard time dealing with web pages that use graphics instead of text for buttons…
Top 10 Accessible Web Authoring Practices describes what you should do to make your pages more accessible- All of these things help search engines and other automatic tools as well
- Comment on this slide
The Document Object Model
- The Document Object Model (DOM) is a cross-language standard for representing XML documents as trees
- Elements, attributes, and text all represented as objects
- Strengths:
- Much easier to manipulate trees than strings
- Same basic model in many different languages (which lowers the learning cost)
- Weaknesses:
- Needs a lot of memory for large documents
- Its generic model doesn't take advantage of the more advanced features of some languages
- Most popular alternative is SAX (the Simple API for XML)
- Turns an XML document into a stream of events
- “Element, element, text, element, text…
- Easy to do very simple things…
- …but anything complex requires the programmer to reimplement a subset of DOM
- Python comes with a simple implementation of DOM called
minidom
- Fast, sturdy, and well documented…
- …if you understand all the terminology, and know more or less what you're looking for)
- Comment on this slide
The Basics
- Every DOM tree has a single root representing the document as a whole
- Doesn't correspond to anything that's actually in the document
- This element has a single child, which is the root node of the document
- This node, and other element nodes, may have three types of children:
- Other elements
- Text nodes
- Attribute nodes
- Every node keeps track of what its parent is
- Allows programs to search up the tree, as well as down
- Example:
| XML | DOM Tree |
|---|
<root>
<first>element</first>
<second attr="value">element</second>
<third-element/>
</root> | ![[A DOM Tree]](img/xml/dom_tree.png)
Figure 18.6: A DOM Tree |
|
|---|
- Note: it's common to forget that text and attributes are stored in nodes of their own
- Some other Python implementations of DOM don't bother
- Make simple things simpler…
- …but only a little bit
- Comment on this slide
Creating a Tree
- Usual way to create a DOM tree is to parse a file
- If this is the file:
<?xml version="1.0" encoding="utf-8"?>
<planet name="Mercury">
<period units="days">87.97</period>
</planet>
- Parse and print like this:
import xml.dom.minidom
doc = xml.dom.minidom.parse('mercury.xml')
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<planet name="Mercury">
<period units="days">87.97</period>
</planet>
- The
toxml method can be called on the document, or on any element node- Note that we specify
"utf-8" as the character encoding - DOM trees always store text as Unicode, so when you're converting the tree to text, you must tell the library how to represent characters
- Can also create a tree by parsing a string
- Works just like parsing a file
import xml.dom.minidom
src = '''<planet name="Venus">
<period units="days">224.7</period>
</planet>'''
doc = xml.dom.minidom.parseString(src)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<planet name="Venus">
<period units="days">224.7</period>
</planet>
- Finally, can build a tree by hand
import xml.dom.minidom
impl = xml.dom.minidom.getDOMImplementation()
doc = impl.createDocument(None, 'planet', None)
root = doc.documentElement
root.setAttribute('name', 'Mars')
period = doc.createElement('period')
root.appendChild(period)
text = doc.createTextNode('686.98')
period.appendChild(text)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<planet name="Mars"><period>686.98</period></planet>
xml.dom.minidom is really just a wrapper around other platform-specific XML libraries- Have to reach inside it and get the underlying implementation object to create the
document node - That node then knows how to create other elements in the document
- Library explains what the first and third arguments to
createDocument are - Middle one tells
createDocument what type of element the document's root node should be
- Set attributes of element nodes using
setAttribute(attributeName, newValue)
- Remember, all attribute values are strings
- If you want to store an integer or a Boolean, you have to convert it yourself
- Add new nodes to existing ones by:
- Asking the document to create the node
- Appending it to a node that's already part of the tree
- Notice that the output of the preceding example wasn't nicely indented
- We didn't tell DOM to create text nodes containing carriage returns and blanks
- Most machine-generated XML doesn't
- Comment on this slide
Walking a Tree
- Often want to visit each node in the tree
- E.g., print an outline of the document showing element nesting
- Simplest way is to write a recursive function
import xml.dom.minidom
src = '''<solarsystem>
<planet name="Mercury"><period units="days">87.97</period></planet>
<planet name="Venus"><period units="days">224.7</period></planet>
<planet name="Earth"><period units="days">365.26</period></planet>
</solarsystem>
'''
def walkTree(currentNode, indent=0):
spaces = ' ' * indent
if currentNode.nodeType == currentNode.TEXT_NODE:
print spaces + 'TEXT' + ' (%d)' % len(currentNode.data)
else:
print spaces + currentNode.tagName
for child in currentNode.childNodes:
walkTree(child, indent+1)
doc = xml.dom.minidom.parseString(src)
walkTree(doc.documentElement)
solarsystem
TEXT (1)
planet
period
TEXT (5)
TEXT (1)
planet
period
TEXT (5)
TEXT (1)
planet
period
TEXT (6)
TEXT (1)
- Node's type is stored in a member variable called
nodeType
ELEMENT_NODE, TEXT_NODE, ATTRIBUTE_NODE, DOCUMENT_NODE
- If a node is an element, its children are stored in a list called
childNodes
- A read-only structure
- See how to add, delete, and move children in a moment
- If a node is a text node, the text is in the member
data
- The single-character text nodes are the carriage returns separating elements
- Traversing a tree like this is just one of many recurring patterns in object-oriented programming
- The Visitor pattern is used to separate traversal of a data structure from operations on its elements
- One class traverses a particular kind of structure the same way each time
- User then defines the operation
- Derive a class, pass a function as an argument, etc.
- The fact that this can be done in several different ways is what makes it a pattern
- Step 1: define generic behavior
class Visitor(object):
def __init__(self):
pass
def visit(self, node):
# When given the document, skip to the root.
if node.nodeType == node.DOCUMENT_NODE:
self.visit(node.documentElement)
return
# Handle other types of nodes.
self.before(node)
self.at(node)
if node.nodeType == node.ELEMENT_NODE:
for child in node.childNodes:
self.visit(child)
self.after(node)
def doNothing(self, node):
pass
before = doNothing
at = doNothing
after = doNothing
- Users call
Visitor.visit with the root node of the tree they want to traverse - Override one or more of the three do-nothing methods to perform actions before, at, or after nodes
- Step 2: derive a class that does something useful (like count how many nodes are in the tree)
class Counter(Visitor):
def __init__(self):
Visitor.__init__(self)
self.count = 0
def at(self, node):
if node.nodeType == node.ELEMENT_NODE:
self.count += 1
- Initialize
count to zero before traversing - Increment the count at each element node
- Step 3: test
if __name__ == '__main__':
src = '<a><b>c</b><d>e</d><f>g<h/>i</f></a>'
tree = xml.dom.minidom.parseString(src)
c = Counter()
c.visit(tree)
assert c.count == 5
- Comment on this slide
Modifying the Tree
- Modifying trees in place is a little bit tricky
- Helps to draw lots of pictures
- Example: want to emphasize the first word of each paragraph
- Get the text node below the paragraph
- Take off the first word
- Insert a new
<em/> element whose only child is a text node containing that word ![[Modifying the DOM Tree]](img/xml/modify_tree.png)
Figure 18.7: Modifying the DOM Tree |
- Ah, but it's not that simple
- What if the first child of the paragraph already has some markup around it?
- E.g., what if the paragraph starts with a link?
- Could just wrap the first child with
<em/>
- But if (for example) the link contains several words, this will look wrong
- We'll ignore this problem for now
- First part of solution: find all the paragraphs using
getElementsByTagName, and iterate over them- You'll use this method a lot…
def emphasize(doc):
paragraphs = doc.getElementsByTagName('p')
for para in paragraphs:
first = para.firstChild
if first.nodeType == first.TEXT_NODE:
emphasizeText(doc, para, first)
- Second part: break the paragraph text into pieces, and handle each piece in turn
- Create a new node for each piece
- Push it onto the front of the paragraph's child list
- Once they've all been handled, get rid of the original text node
def emphasizeText(doc, para, textNode):
# Look for optional spaces, a word, and the rest of the paragraph.
m = re.match(r'^(\s*)(\S*)\b(.*)$', str(textNode.data))
if not m:
return
leadingSpace, firstWord, restOfText = m.groups()
if not firstWord:
return
# If there's text after the first word, re-save it.
if restOfText:
restOfText = doc.createTextNode(restOfText)
para.insertBefore(restOfText, para.firstChild)
# Emphasize the first word.
emph = doc.createElement('em')
emph.appendChild(doc.createTextNode(firstWord))
para.insertBefore(emph, para.firstChild)
# If there's leading space, re-save it.
if leadingSpace:
leadingSpace = doc.createTextNode(leadingSpace)
para.insertBefore(leadingSpace, para.firstChild)
# Get rid of the original text.
para.removeChild(textNode)
- Third part: test it
- Yes, it really is part of the program
if __name__ == '__main__':
src = '''<html><body>
<p>First paragraph.</p>
<p>Second paragraph contains <em>emphasis</em>.</p>
<p>Third paragraph.</p>
</body></html>'''
doc = xml.dom.minidom.parseString(src)
emphasize(doc)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<html><body>
<p><em>First</em> paragraph.</p>
<p><em>Second</em> paragraph contains <em>emphasis</em>.</p>
<p><em>Third</em> paragraph.</p>
</body></html>
- Comment on this slide
Summary
- There's a lot of hype in hypertext
- Haven't yet heard anyone claim that XML will cure the common cold, but I'm sure it's been said
- Strengths:
- One set of rules for people to learn
- One parser can handle all of their data
- At least, the low-level syntactic bits—still need to figure out what all those tags mean
- Weaknesses:
- Raw XML is hard to read
- Particularly if it has been generated by a machine
- A lot of data isn't actually trees
- When storing a 2D matrix or a table, you have to organize data by row or by column…
- …either of which makes the other hard to access
- There are a lot of complications and subtleties
- Most applications ignore most of them
- Which means that they fail (usually badly) when confronted with something outside the subset they understand
- Like Inglish speling, it's here to stay
- Comment on this slide
A Mini-Project
Eating Your Own Cooking
- Been talking since the first lecture about the importance of tools, and about building your own to automate repetitive tasks
- This lecture takes a look at some of the tools used to manage the raw material for this course
- Show technologies like XML and regular expressions in action
- Show how to design something simple, and grow it over time
- Starting point: lecture slides are written in a simple XML format
- Root of each lecture is a
<lec/> element with title and id attributes - May contain one or more
<topic/> elements- Must have
title attribute - May optionally have a
summary attribute (used to construct the syllabus)
- Topics contain one or more
<slide/> elements- These contain
<b1/> (for “bullet level 1”), which contain <b2/>, and so on - May also contain tables, images, and code fragments
- Our task is to validate these files to make sure that:
- They contain only printable characters
- Tabs haven't been used for indentation
- The ID in the root
<lec/> element matches the filename - All the external files the lecture references (such as images and sample code) exist
- Solution: write a command-line utility that:
- Takes a list of filenames, along with
- Some command-line flags specifying which checks to run (or omit), and
- Reports any problems
- You probably won't ever need to do this for lecture slides…
- …but at some point, you probably will want to check the integrity of data files, experimental results, etc.
- Comment on this slide
Checking for Tabs
- Start with the simplest task: checking for tabs in files
- Open each file in turn
- Check each line for tabs
- Note: could also read entire contents of file into a string, and search for tabs in that
- If any found, print message
#!/usr/bin/env python
'''Check for tabs in one or more files.'''
import sys
def checkTabs(filename):
'''Look for tabs.'''
infile = open(filename, 'r')
for line in infile.readlines():
if line.find('\t') >= 0:
print '%s contains tabs' % filename
break
infile.close()
if __name__ == '__main__':
for filename in sys.argv[1:]:
checkTabs(filename)
- Great—except it only works on files
- Can't pipe things to it, because it doesn't know how to read standard input
- Implement the standard Unix convention: if no filenames provided, read from standard input
- Hm…
sys.stdin is an already-open file, not a filename - Change signature of
checkTabs to take both the filename, and an open file- Move the
open and close to the main body
#!/usr/bin/env python
'''Check for tabs in one or more files, or on standard input.'''
import sys
def checkTabs(filename, infile):
'''Look for tabs.'''
for line in infile.readlines():
if line.find('\t') >= 0:
print '%s contains tabs' % filename
break
if __name__ == '__main__':
if len(sys.argv) == 1:
checkTabs('<stdin>', sys.stdin)
else:
for filename in sys.argv[1:]:
infile = open(filename, 'r')
checkTabs(filename, infile)
infile.close()
- Great—except it doesn't report errors like missing or unreadable files
- Printing a stack trace doesn't count
- Fix by wrapping the code in an exception handler
- Only catch the kinds of exceptions we think are reasonable to expect
- Don't want the error handling to mask errors that we didn't anticipate
#!/usr/bin/env python
'''Check for tabs in one or more files, or on standard input, and
report errors.'''
import sys
def checkTabs(filename, infile):
'''Look for tabs.'''
for line in infile.readlines():
if line.find('\t') >= 0:
print '%s contains tabs' % filename
break
if __name__ == '__main__':
try:
if len(sys.argv) == 1:
checkTabs('<stdin>', sys.stdin)
else:
for filename in sys.argv[1:]:
infile = open(filename, 'r')
checkTabs(filename, infile)
infile.close()
except IOError, e:
print >> sys.stderr, e
- Note: could equally well put the exception handler:
- Inside the
else (since we don't think I/O errors can happen while reading standard input) - Inside the
for (so that if an error occurs while reading one file, the program continues on to the next)
- Comment on this slide
Running Tools
- Now, how to run the validation tool?
python check_tabs.py file1 fil2 file… will work…- …but typing in a bunch of filenames every time would be annoying
- Which means that we wouldn't do it as often as we should
- And we know we're going to have other validation tools to run as well
- Put the command in a Makefile
- If it's worth doing again, it's worth automating
- Directory structure of the course:
- A root directory
lec for lecture notes (in .swc files)util for utility programs (like the validation tools)img for images- Images for
lec/xyz.swc go in the img/xyz directory
src for source codesrc/xyz holds sample files for the XYZ lecture
- Makefile runs tab checker on all
.swc files# Re-make everything used in the Software Carpentry course.
all :
@echo 'options: clean validate'
clean :
@rm -f *~ */*~ */*/*~
validate :
@python util/check_tabs.py lec/*.swc
- Note: also added a
clean target that gets rid of editor backup files ending in ~ - And a default target called
all, that lists the things the Makefile can do - Remember, the
"@" in front of the commands means, “Don't echo the command before running it”
- Comment on this slide
Checking for Printable Characters
- Next on the list: make sure that files only contain printable characters
- I.e., whitespace, alphanumeric, and punctuation
- Some editors insert “smart” (curly) quotes, automatically convert
"---" into "—", etc.
- But other editors can't display these
- So disallow them, and require authors to use XML escape sequences for anything special
- Solution: make sure every character in the file is in
string.printable
- Contains letters, digits, spaces, and common punctuation
- Easier to use this than to write our own regular expression
- Should we add this function to the existing validation program, or create a second program?
- The former lets the checking functions share the code that opens files, handles errors, etc.
- The latter makes it easier to re-use the pieces separately
- Solution: put them in the same program, and provide command-line flags to disable certain tests
- Might be philosophically purer to have the flags turn tests on…
- …but the normal case is going to be run them all
- Parse command-line options using the
getopt module- First argument is the list of command-line arguments to parse
- Not including
sys.argv[0], which is the name of the program
- Second is a string telling it what flags to look for, and whether they take arguments
"a:bcd:" means “-a and -d have an argument, -b and -c don't”
doPrintable = True
doTabs = True
settings, filenames = getopt.getopt(sys.argv[1:], 'pt')
for (opt, arg) in settings:
if opt == '-p':
doPrintable = False
elif opt == '-t':
doTabs = False
- Have to make another change to
checkTabs
- It and
checkPrintable need to process the same data - So neither can read that data in from the file
- Unless we want to open and read the file twice
- Solution: separate input, processing, and output
- Main body reads data and calls validation functions
try:
if not filenames:
lines = sys.stdin.readlines()
checkTabs('<stdin>', lines)
checkPrintable('<stdin>', lines)
else:
for filename in filenames:
infile = open(filename, 'r')
lines = infile.readlines()
infile.close()
checkTabs(filename, lines)
checkPrintable(filename, lines)
except IOError, e:
print >> sys.stderr, e
- Change to
checkTabs is easy def checkTabs(filename, lines):
'''Look for tabs.'''
for line in lines:
if line.find('\t') >= 0:
print '%s contains tabs' % filename
break
- And
checkPrintable is simple as well def checkPrintable(filename, lines):
'''Look for non-printable characters.'''
for line in lines:
for c in line:
if c not in string.printable:
print '%s contains non-printable characters' % filename
print line
break
- Comment on this slide
Checking Glossary Entries
- Course has a glossary that defines new or unusual terms
- Entries in lectures are formatted as
<d ref="immutable">Immutable</d>
<d/> for “definition”ref attribute is the term that appears in the glossary- Contained text is displayed in-line
- Glossary is structured like this:
<glossary>
<glosssec title="A">
<glossitem id="absolute_path" term="absolute path">...definition...</glossitem>
<glossitem id="abstract_data_types" term="abstract data types">...</glossitem>
<glossitem id="access_control" term="access control">...</glossitem>
...
<glossitem id="automatic_variables" term="automatic variables (in Make)">...</glossitem>
</glosssec>
...
</glossary>
- Goal: make sure every term is defined once, and only once
- Read in the glossary
- Record the item IDs
- Read a set of files, marking items as they're seen
- If an item has already been marked, report the duplication
- At the end, look for items that haven't been marked off
- Hm…what happens if we're only checking one file?
- Want to be able to suppress the check for all items being marked off
- So need two flags:
- One to specify the name of the glossary file (so we don't try to read it as a lecture file)
- One to turn off the check for all items being marked off
glossary = None
doGlossaryComplete = True
doPrintable = True
doTabs = True
settings, filenames = getopt.getopt(sys.argv[1:], 'G:gpt')
for (opt, arg) in settings:
if opt == '-G':
glossary = arg
elif opt == '-g':
doGlossaryComplete = False
elif opt == '-p':
doPrintable = False
elif opt == '-t':
doTabs = False
- Note: if no glossary file specified, don't check glossary items at all
- Now have enough logic that it's worth reorganizing the main processing loop
- If no filenames specified, set the list of filenames to
['<stdin>'] - Write a function
readFile to open and read a file- If the function name is
"<stdin>", it reads from sys.stdin - Have function function return both a list of lines, and the XML DOM tree
- We need the first to check for tabs and printable characters, and the second to look for glossary items
- Modified processing code includes checks for what to do with the glossary
try:
if glossary:
glossary = readGlossaryFile(glossary)
if not filenames:
filenames = ['<stdin>']
for filename in filenames:
lines, doc = readFile(filename)
checkTabs(filename, lines)
checkPrintable(filename, lines)
if glossary:
checkGlossary(filename, doc, glossary)
if glossary and doGlossaryComplete:
checkGlossaryComplete(glossary)
except IOError, e:
print >> sys.stderr, e
except xml.parsers.expat.ExpatError, e:
print >> sys.stderr, e
- Then write
readFile
def readFile(filename):
if filename == '<stdin>':
data = sys.stdin.read()
else:
infile = open(filename, 'r')
data = infile.read()
infile.close()
infile = cStringIO.StringIO(data)
lines = infile.readlines()
doc = xml.dom.minidom.parseString(data)
return lines, doc
- Three functions left to write
readGlossaryFile builds a dictionary whose keys are the terms defined in the glossary- Values are
None for now—see why in a moment def readGlossaryFile(filename):
if filename is None:
return None
infile = open(filename, 'r')
doc = xml.dom.minidom.parse(infile)
terms = doc.getElementsByTagName('glossitem')
result = {}
for term in terms:
t = str(term.getAttribute('id'))
result[t] = None
return result
checkGlossary processes uses of glossary terms in a single lecture file- Records the name of the file in which the term appears in the
glossary dictionary - If some other filename is already there, reports duplicate definition
def checkGlossary(filename, doc, glossary):
defns = doc.getElementsByTagName('d')
for defn in defns:
d = str(defn.getAttribute('ref'))
if d not in glossary:
print 'term %s in %s missing from glossary' % (d, filename)
elif glossary[d] is not None:
print 'term %s defined in %s and %s' % (d, filename, glossary[d])
else:
glossary[d] = filename
checkGlossaryComplete looks for glossary entries without associated filenames- I.e., terms that are in the glossary, but aren't highlighted anywhere in the lectures
def checkGlossaryComplete(glossary):
unused = []
for g in glossary:
if glossary[g] is None:
unused.append(g)
if unused:
unused.sort()
print 'unused terms'
for u in unused:
print '\t%s' % u
- Comment on this slide
Checking Cross-References
- Last task (for now): check external files
- Every code fragment and image that's referenced must exist
- Every code file and image must be referenced
- Obvious opportunities for abstraction
- Get the set of files in a directory
- Get the set of files referenced in the lecture file
- Report differences in both directions
- Follow the pattern used for checking the glossary
-I dir specifies the root directory for images- If none provided, don't check images
-C dir specifies the root directory for code fragments- Use the value of the lecture's
id attribute to determine which particular subdirectory to search- No point having a “don't bother to check completeness” option, since each lecture's files are stored separately
- Add options to
getopt.getopt string, and four more lines to the main processing loop if codeRootDir:
checkFiles(filename, doc, codeRootDir, 'code', 'src')
if imageRootDir:
checkFiles(filename, doc, imageRootDir, 'img', 'src')
- Write
checkFiles
- Construct the path to the directory
- Find out what it contains
- Subtract things like the
.svn directory - Note how we've left room for other things to be excluded?
- Find out what's used in the document
- Using a
Set automatically handles multiple references to a single source file
- Rely on set subtraction to find differences
def checkFiles(filename, doc, rootDir, eltName, attrName):
# What should we ignore?
Excludes = ['.svn']
# Find out where we're supposed to look.
docId = str(doc.documentElement.getAttribute('id'))
dir = os.path.join(rootDir, docId)
if not os.path.isdir(dir):
print >> sys.stderr, 'Missing directory: %s' % dir
return
# Find out what's there that we care about.
actual = Set(os.listdir(dir))
for e in Excludes:
actual.discard(e)
# Find what's used in the document.
elts = doc.getElementsByTagName(eltName)
referenced = Set()
for e in elts:
if e.hasAttribute(attrName):
referenced.add(str(e.getAttribute(attrName)))
# Show differences (if any).
showDiff(filename, dir, 'not found', referenced - actual)
showDiff(filename, dir, 'unused', actual - referenced)
- Then write the helper function
showDiff
def showDiff(filename, dir, title, values):
if len(values):
print '%s (for file %s and directory %s):' % (title, filename, dir)
for v in values:
print '\t%s' % v
- Comment on this slide
Summary
- It took an hour to write and debug this code
- Started with something simple
- Tested everything as I wrote it
- Found and fixed 31 errors in the lecture notes
- Now type
make clean and make validate before doing a commit- Ensures that what goes into the repository has at least some chance of being sensible
- Ensures that what other people add to the course will conform to style and usage rules
- Course materials include several other tools
- Re-run sample programs and check that the output stored in the lectures is still correct
- Translate lecture notes, glossary, and bibliography into HTML
- Extract
summary values from each <topic/> element to create HTML version of course syllabus - Etc.
-
[Clark 2004]
discusses other things you can do to automate routine project maintenance tasks
- Prevents your project materials from rusting
- Which makes those materials easier to share
- And gives you higher confidence that they're working correctly
- Gives you more time to concentrate on things that actually require human attention
- Comment on this slide
Exercises
Exercise 19.1:
What does getopt do when it encounters an argument it
doesn't recognize? Write a short program that demonstrates this
behavior, that can be run on its own without the user passing in
any command-line arguments.
Binary Data
Isn't It All 'Binary'?
- All data is stored as 1's and 0's
- But those 1's and 0's may represent:
- Characters that can be displayed as text
- Something else
- That “something else” is (misleadingly) called binary data
- Usually means “anything you can't manipulate it with a standard text editor”
- Why use it?
- Size
"10239472" is 8 bytes long- The 32-bit integer it represents is 4 bytes
"False" is 5 bytes, 0 is one bit (40:1)
- Speed
- Takes dozens of operations to add the integer represented by
"34" to the one represented by "57"
- Hardware interfaces
- Somewhere, something has to convert the electrical signal from the gas chromatograph to a readable number
- Lack of anything better
- It's possible to represent images as text (ASCII art, PostScript)
- But sound? Or movies?
- A lot of scientific data is stored in binary formats
- Important to know how to work with it
- But please, don't write code to do this unless you absolutely have to
- Good libraries exist for working with every image, sound, and video format out there
- Comment on this slide
How Numbers Are Stored
- Positive numbers stored in base-2 format
- 10012 is (1×23)+(0×22)+(0×21)+(1×20) = 9
- Could use sign-and-value for negative numbers
- 00112 would be 310, 10112 would be -310
- But then there'd be two representations of zero (0000 and 1000)
- More efficient to use two's complement
- “Roll over” when going below zero, like a car's odometer
- 11112 is -110, 11102 is -210, etc.
- 10002 is the most negative 4-bit integer, 01112 the most positive
- Asymmetric: there is one more negative number than positive
- Since there has to be room for 0 in the middle
- Comment on this slide
Bitwise Operators
- Like most languages, Python has four operators that work on bits
| Name | Symbol | Purpose | Example |
|---|
| And | & | 1 if both bits are 1, 0 otherwise | 0110 & 1010 = 0010 |
| Or | | | 1 if either bit is 1 | 0110 & 1010 = 1110 |
| Xor | ^ | 1 if the bits are different, 0 if they're the same | 0110 & 1010 = 1100 |
| Not | ~ | Flip each bit | ~0110 = 1001 |
Table 20.1: Bitwise Operators in Python
- Note: the name “xor” is short for “exclusive or”, i.e., either/or
- Common to store several independent yes-or-no flags as a set of bits
- Example: need to record whether a sample contains any mercury, phosphorus, or chlorine
- Use bit 1 for mercury, bit 2 for phosphorus, bit 3 for chlorine
![[Using Bits to Record Sets of Flags]](img/binary/bit_flags.png)
Figure 20.1: Using Bits to Record Sets of Flags |
- Define constants to test for particular elements
# hex binary
MERCURY = 0x01 # 0001
PHOSPHORUS = 0x02 # 0010
CHLORINE = 0x04 # 0100
# Sample contains mercury and chlorine
sample = MERCURY | CHLORINE
print 'sample: %04x' % sample
# Check for various elements
for (flag, name) in [[MERCURY, "mercury"],
[PHOSPHORUS, "phosphorus"],
[CHLORINE, "chlorine"]]:
if sample & flag:
print 'sample contains', name
else:
print 'sample does not contain', name
sample: 0005
sample contains mercury
sample does not contain phosphorus
sample contains chlorine
- Can use and, or, and not to set specific bits to 1 or 0
- To set the ith of
x to 1:
- Create a value
mask in which bit i is 1 and all others are 0 - Use
x = x | mask
- To set the ith of
x to 0:
- Create a value
mask in which bit i is 1 and all others are 0 - Negate it using
~, so that the ith bit is 0, and all the others are 1 - Use
x = x & mask
- Comment on this slide
Shifting
- Shifting an integer's bits left N places written as
x << N
- Each leftward shift corresponds to multiplying by 2
- Just as shifting a decimal number left corresponds to multiplying by 10
- Example: 6<<2 is 01102<<2, or 11002, which is 12
- Shifting a number right corresponds to division by 2 (throwing away the remainder)
- 710>>1 is 01112>>1, or 00112, which is 310
- Shifting is not more efficient than multiplication and division on modern computers
- What happens if the top bit changes value as a result of a shift?
- 610<<1 = 01102<<1 = 11002
- On a 4-bit machine, this is -410, not 1210
- Some machines preserve the sign bit when shifting down
- So 11002>>1 = 11102, instead of 01102
- Depends on the hardware being used
- Java provides a separate operator for this
- Comment on this slide
Floating Point
- Floating point numbers are more complicated
- A 32-bit float has:
- One bit for the sign
- 23 bits for the mantissa (or value)
- 8 bits for the exponent
![[Floating Point Representation]](img/binary/float_rep.png)
Figure 20.2: Floating Point Representation |
- This means that you can only represent a fixed set of values
- If the actual value isn't in that set, you must settle for the closest available approximation
- Consequence #1: values are unevenly spaced
- Imagine we only had 6 bits for each floating-point number (1 sign, 3 mantissa, 2 exponent)
- Means less absolute precision for numbers with larger magnitudes
![[Uneven Spacing of Floating-Point Numbers]](img/binary/uneven_spacing.png)
Figure 20.3: Uneven Spacing of Floating-Point Numbers |
- Consequence #2: ro