Welcome
- These notes are part of a course on software development skills for scientists and engineers being prepared by Greg Wilson for the Python Software Foundation
- Please use the links included with the slides to provide comments and feedback
- Comment on this slide
Course Outline
Acknowledgments
- The
Python Software Foundation, for the grant that made this work possible - The
University of Toronto, for letting me test this version of this course on its students - Its
Department of Computer Science, for giving me a home, and making me feel welcome - Brent Gorda, who helped create the first version of this course
- Heather Mayer, for the artwork
YesLogic, for generously donating licenses for Prince (their XML-to-PDF converter)- The many people who commented on this material, and suggested ways to improve it:
| Hossein Bidhendi | Stephane Bortzmeyer | Michelle Craig | Simon Duane | Paul Dubois | Hans Fangohr | Brent Gorda |
| Adam Goucher | Perry Greenfield | Paul Gries | Brandon King | Catherine Letondal | Michelle Levesque | Andy Lumsdaine |
| Laurie MacDougall | Keir Mierle | Kit-Sun Ng | Dirkjan Ochtman | Victor Putz | Irving Reid | Karen Reid |
| Paul Salvini | Diomidis Spinellis | Bill Spotz | Tom Van Vleck | Jim Vickroy | | |
- Frank Willison—I'm sorry this one was finished too late for you to tune up
- John Scopes, and everyone else with the courage to fight for the idea that the truth is more important than doctrine
- Comment on this slide
Introduction
Motivation
- Computers are as important to scientists as telescopes and test tubes
- Analyze problems that are too complex for traditional means
- Simulate things that can't be studied in laboratories
- Many scientists now spend much of their professional lives writing and maintaining software
- A quarter of graduate students in science and engineering spend 25-50% of their time programming
- But most scientists have never been taught how to do this efficiently
- It's a long way from the loops and arrays of first year to simulating bone development in foetal marsupials…
- Like being shown how to differentiate polynomials, then expected to invent the rest of calculus
- This course will teach you how to design, build, maintain, and share programs more efficiently
- Focus: tools and techniques appropriate for half a dozen people working together for a year
- Everything you do at that scale will also make you more productive when you're working on your own for a week
- Will not turn you into a computer scientist
- Far too many of them around anyway
- Instead, goal is to teach you the equivalent of good laboratory technique for computational science
- The 20% of ideas that account for 80% of real world use
- Software carpentry, rather than software engineering
- Comment on this slide
Meeting Standards
- Experimental results are only publishable if they are believed to be correct and reproducible
- Equipment calibrated, samples uncontaminated, relevant steps recorded
- In practice, almost always rely on the professionalism of the people doing the work
- How well do computational scientists meet these standards?
- Correctness of code rarely questioned
- We all know programs are buggy, but when was the last time you saw a paper rejected because of concerns over the quality of the software used to produce the results?
- Reproducibility often nonexistent
- How many people can reproduce, much less trace, each computational result in their thesis?
- The times they are a-changing
- Standards for computational work can only go up
- Change can happen almost overnight
- Like the American car market when German and Japanese imports appeared in the 1970s
- Comment on this slide
The Most Important Idea in This Course
- Improving quality improves productivity
- I.e., the tools and techniques you must adopt to produce better code also help you write more code, faster
- Comment on this slide
Who You Are
- User stories
- Important part of designing user interfaces for mass-distribution software
- Helps make discussion of features and usability more concrete
- Harry
- 27; B.Sc. in zoology
- Did an introductory Fortran course nine years ago, and attended a workshop on web-based bioinformatics tools when he started his job
- Now developing fuzzy pattern-matching algorithms for Genes'R'Us, a biotech firm with labs in four countries
- Ron
- 24; B.Eng in mechanical engineering, now doing an M.Sc. part time
- Did C in first year; has been using MATLAB ever since
- Modeling thermal degradation (a.k.a. “melting”) of firefighters' helmets
- Hermione
- 34; Ph.D. in physics
- Took two courses on C and two on numerical analysis as an undergrad, and a computer graphics course as a graduate student
- Now in charge of the 5-person flywheel braking group at Yoyodyne Inc.
- Ginny
- 22; finished a B.Eng. in chemical engineering last year, now doing an M.Sc. in biochemistry
- Did C in first year, and has built a personal web site (static HTML only)
- Thesis topic is improving the yield of organic fullerene production
- Albus
- 47; Ph.D. in mathematics; studies large random graphs using both analytic and computational techniques
- Undergraduate degree in Mathematics and Computer Science in the 1970s; programs mainly in C++
- Professor at Euphoric State University; former chair of graduate studies
- Comment on this slide
A Quick Self-Test
- Adapted from
[Spolsky 2004]
and
[McConnell 1997]
- 1 for “yes”, 0 for “no”, and -1 if you don't understand the question
- So:
- Do you use version control?
- Can you rebuild everything in one step?
- Do you have an automated test suite?
- Do you build the software, and run the test suite, daily?
- Do you have a bug database?
- Do you use a symbolic debugger?
- Do you use a style checker to ensure that your software is written in a uniform, readable way?
- Can you trace everything you release back to the software that produced it?
- Do you document as you program, and keep your documentation in your source files?
- Can you set up a development environment (including any libraries you need) on a fresh machine without heroic effort?
- Do you have a schedule with small binary milestones?
- Do you estimate how long tasks will take before you start, and compare that with how long they actually took?
- If you're working with others, or developing software that other people may use:
- Do you test your interfaces using paper prototypes before implementing them?
- Is there a searchable archive of discussions about the project?
- Do team members write and share small tools for automating common tasks?
- Does your schedule allow for infrastructure development, training, sick time, etc.?
- And your score is?
- Comment on this slide
Learn by Building
- So why are we where we are?
- It's difficult to learn these things from academic computer scientists
- CS research is more concerned with rapid prototyping than with reliability
- People are naturally sceptical of innovation
- Particularly after they've seen a few bandwagons roll through
- Glass's Law
[Glass 2002]
: any new way of doing things initially slows you down
- You only have to be as good as the competition
- American auto makers in the 1970s
- This course's approach:
- Introduce some basic tools
- Students immediately see benefit of taking the course
- Tools can be used to manage the course itself
- Show students how to build tools like these
- Where “how” includes both what goes into the software, and how to create it
- Solidifies understanding of tools' capabilities and limitations
- Makes discussion of technique more concrete
- Show students what else they can do with their new skills
- The right way to tackle issues that come up over and over again
- Key point: avoid overload
- People who already know these things tend to underestimate how hard they are to learn
- No point preaching to the converted
- Try instead to move the middle of the bell curve to the right
- Why is the course called software “carpentry”?
- Because it focuses on the details of the craft
- If software engineering is about building an electronic version of the Channel Tunnel, this stuff is the equivalent of putting an extension on the house
- Comment on this slide
Topics
- Shell programming
- Version control
- Automating builds
- Python (4 lectures)
- Systematic debugging
- Higher-level programming (2 lectures)
- Testing (2 lectures)
- Coding style and reading code
- Data crunching (4 lectures)
- Web programming and security (3 lectures)
- Software development project tools and processes (4 lectures)
- Comment on this slide
Setting Up
- Some previous programming experience
for loops, if/then/else- Function calls
- Arrays
- File I/O
- Compilation
- Individual setup
- Python (version 2.4 or higher)
- Cygwin (on Windows)
- An editor
- We'll look at smart ones later in the course
- Subversion
- We'll spend a lecture on this next week
- Time
- Expect to spend 2-3 hours outside class for each lecture
- Comment on this slide
Recommended Reading
- “If you only have time to read one book, make time to read two”
-
[Glass 2002]
summarizes what we actually know about programmers' productivity
-
[Hunt & Thomas 1999]
and
[Gunderloy 2004]
are about the things that distinguish good programmers from bad ones
-
[Lutz & Ascher 2003]
is the standard introduction to Python
-
[Langtangen 2004]
is a comprehensive introducton to Python aimed squarely at scientists and engineers
- Goes into much more detail than this course will
- But doesn't address broader issues, such as programming practices
- See the Bibliography for others
- Check out some of the Online Resources as well
- Comment on this slide
Typographic Conventions
Version Control
Problem #1: Synchronizing Files
- Want to work on one set of files on three different machines
- Option 1: use a shared file system
- Difficult to set up
- And even more difficult to make secure
- Inflexible: what if you're on the road?
- Option 2: Carry around a floppy or USB dongle
- Have to remember to copy files onto it and then off again
- What if your project is too big?
- Option 3: mail, FTP, SCP, etc.
- Still have to remember to push and pull exactly the right files at exactly the right time
- Option 4: get the computer to do the work
- Keep a master copy in one place
- Use a program to synchronize working copies when and as needed
- Comment on this slide
Problem #2: Undoing Changes
- Often want to undo changes to a file
- Start work, realize it's the wrong approach, want to get back to starting point
- Like “undo” in an editor…
- …but longer-lived—keep the whole history of every file, forever
- Similarly, often want to see who changed what, and when
- When working in teams, want to see what your partners did
- Bugs are more likely to be in fresh code than in code that's been running for a long time
- Comment on this slide
Solution: Version Control
- Solve both problems at once by using a version control system (VCS)
- Mechanics:
- Keep the master copy of every file in a central repository
- Actually, keep all old versions of every file
- Everyone on the development team does their work in a working copy
- When you're ready to share your changes, you commit them to the repository
- The VCS saves the old version of the file, then writes your changes on top
- Also records the time of the change, who made it, and a comment
- Take a look in a moment at what happens if two or more people try to make changes at the same time
- Comment on this slide
CVS and Subversion
- Two open source version control systems in widespread use
- Many others available commercially
- If you can afford it, use Perforce
- CVS (Concurrent Version System)
- Invented in the 1980s
- Very popular, but showing its age
- Flaw #1: it keeps track of each file separately
- But authors often change several files in tandem
- Since CVS has no notion of a “batch submit”, there's no reliable way to say, “What other changes were made in conjunction with this one?”
- Flaw #2: you can create new directories, but can't ever delete old ones
- Subversion
- Designed as a backward-compatible replacement for CVS starting in 2000
- Fixes both of the major flaws in CVS
- Many open source projects have switched, or are switching…
- …so we'll use it in this course
- See the
Subversion site for details
- Comment on this slide
Basic Use
- Assume for a moment that a repository has been created, and that Ron and Hermione already have working copies
- Ron wants to make changes to the rotor spin simulation
- Runs
svn update to bring his copy up to date with the repository - Edits
spin.c and spin.h - Runs
svn commit to save those changes in the repository- Note: the whole repository gets a new version number
- Ron realizes that he forgot to make one change
- Runs
svn update again, just in case Hermione has also been making changes (she hasn't) - Edits
spin.c again - Runs
svn commit a second time
- Several hours later, Hermione runs
svn update on her working copy- Subversion copies Ron's changes into her directory
![[Basic Use]](img/version/basic_use.png)
Figure 3.1: Basic Use |
- Comment on this slide
How To Do It
- One way to use Subversion is to type commands in a shell
- Guaranteed to work everywhere without anything else being installed
- But there are several good graphical interfaces for Subversion too
RapidSVN runs on Windows, Linux, and Mac- Well, maybe “walks” is a better description—as of Version 0.9, it's not the fastest thing in the world
![[RapidSVN]](img/version/rapidsvn.png)
Figure 3.2: RapidSVN |
TortoiseSVN is a Windows shell extension- Which means that it integrates with the Windows file browser, rather than running separately
- But that also means that it doesn't work on Linux or Mac
![[TortoiseSVN]](img/version/tortoisesvn.png)
Figure 3.3: TortoiseSVN |
- And if you're on a Macintosh, there's
SmartSVN - Comment on this slide
Working Together
- What if two (or more) people want to edit the same file at the same time?
- Option 1: prevent it
- Only allow one person to have a writeable copy of the file at once
- Pessimistic concurrency
- Microsoft Visual SourceSafe
- Option 2: patch up afterwards
![[Merging Conflicts]](img/version/conflict_merge.png)
Figure 3.4: Merging Conflicts |
- Ron and Hermione both have copies of
spin.c version 15int maxRotateSetting(int * available, int length)
{
int i, maxFound;
if (length == 0) {
return 0;
}
for (i=0; i<length; i+=1) {
if (available[i] > maxFound) {
maxFound = available[i];
}
}
return maxFound;
}
- Ron commits his changes
- Creates
spin.c version 16 // Find maximum rotation setting from those available,
// or 0 if none are available.
int maxRotateSetting(int * available, int length)
{
int i, maxFound;
if (length == 0) {
return 0;
}
for (i=0; i<length; i+=1) {
if (available[i] > maxFound) {
maxFound = available[i];
}
}
return maxFound;
}
- Meanwhile, Hermione is editing her copy, and produces this:
// Find maximum rotation setting, or 0.
int maxRotateSetting(int * available, int length)
{
int i, maxFound = 0;
for (i=0; i<length; i+=1) {
if (available[i] > maxFound) {
maxFound = available[i];
}
}
return maxFound;
}
- Tries to submit her changes: conflict!
- Subversion puts both Ron's and Hermione's changes in Hermione's working copy, with markers
<<<<<<< .mine
// Find maximum rotation setting, or 0.
=======
// Find maximum rotation setting from those available,
// or 0 if none are available.
>>>>>>> .r471
int maxRotateSetting(int * available, int length)
{
int i, maxFound = 0;
for (i=0; i<length; i+=1) {
if (available[i] > maxFound) {
maxFound = available[i];
}
}
return maxFound;
}
- Also creates
spin.c.mine, spin.c.15, and spin.c.16 for reference - Hermione must decide what to keep, and what to throw away
- Subversion won't let her commit his changes until all the conflict markers are eliminated
- Once Hermione is done editing, she:
- Runs
svn resolved spin.c to tell Subversion the conflict has been fixed - Runs
svn commit spin.c to create spin.c version 17
- Comment on this slide
What Versions Actually Mean
- The discussion above referred to “version 16 of
spin.c”, but in fact there is no such thing - Instead, there's version 16 (or 17, or 18…) of the repository
- Users are supposed to try to keep the files in the repository in a consistent state
- I.e., don't submit things that are half-done
- Since the next person to do an update would then be in the same half-done state you are
- Subversion therefore updates the version number on the whole repository every time a set of changes is submitted
- Each change set can affect any number of files (including adding or deleting files)
- The phrase “version 229” therefore uniquely identifies an entire set of files
- Unlike CVS and other systems, where version 319 of one file might correspond to version 107 of another, and version 794 of a third
- Comment on this slide
Warning: Binary Files
- Subversion can only mark conflicts this way in text files
- I.e., files that store lines of human-readable characters
- Source code, HTML—basically, anything you can edit with Notepad, Vi, or Emacs
- Images, video clips, Microsoft Word, and many other formats aren't
- When there's a conflict, Subversion saves your copy and the master copy side by side in your working directory
- Up to you to resolve the differences
- Comment on this slide
Rolling Back Changes
- Suppose Ron decides that he doesn't like his recent changes
svn diff will show him which files he has changed, and what those changes are- He hasn't committed anything yet, so he can use
svn revert to re-synchronize with the master copy - If you find yourself doing this repeatedly, you should probably go and do something else for a while…
- Now suppose that Ron decides he doesn't like the changes Hermione just made to
spin.c
- Wants to do the equivalent of “undo” on several files
svn log shows recent history- He decides he wants to revert to version 16 of the repository
svn merge -r 17:16 spin.c means “merge changes, going from version 17 to version 16” (i.e., backwards)
- Can obviously go back to even earlier versions to undo more changes
![[Undoing Changes]](img/version/merge_undo.png)
Figure 3.5: Undoing Changes |
- Comment on this slide
And Finally, Getting Started
- To create a repository:
- Decide where to put it (e.g.,
/rotor/repo) - Go into the containing directory:
cd /rotor svnadmin create repo
- Can then interact with repository in two ways
- Directly through the file system:
file:///rotor/repo
- Use this if you're working on the same machine the repository is on
- Through a web server:
https://your.host.name/rotor/repo
- Use this if the repository is on a remote machine
- Note: requires your system administrator to configure the web server properly
https (instead of http) means “use a secure connection”
- To get a working copy (assuming you're using a web server):
svn checkout https://your.host.name/rotor/repo- Creates a new directory
repo
- Common to give it a more informative name using
svn checkout https://your.host.name/rotor/repo rotorproject
- Important: only use
svn checkout once, to initialize your working copy
- Comment on this slide
Subversion Command Reference
| Name | Purpose | Example |
|---|
svn add | Add files and/or directories to version control. | svn add newfile.c newdir |
svn checkout | Get a fresh working copy of a repository. | svn checkout https://your.host.name/rotor/repo rotorproject |
svn commit | Send changes from working copy to repository (inverse of update). | svn commit -m "Comment on the changes" |
svn delete | Delete files and/or directories from version control. | svn delete oldfile.c |
svn help | Get help (in general, or for a particular command). | svn help update |
svn log | Show history of recent changes. | svn log --verbose *.c |
svn merge | Merge two different versions of a file into one. | svn merge -r 18:16 spin.c |
svn mkdir | Create a new directory and put it under version control. | svn mkdir newmodule |
svn rename | Rename a file or directory, keeping track of history. | svn rename temp.txt release_notes.txt |
svn revert | Undo changes to working copy (i.e., resynchronize with repository). | svn revert spin.h |
svn status | Show the status of files and directories in the working copy. | svn status |
svn update | Bring changes from repository into working copy (inverse of commit). | svn update |
Table 3.1: Common Subversion Commands
- Comment on this slide
How to Read Subversion Output
svn status compares your working copy with the repository, printing one line for each file that's worth talking about$ svn status
M spin.c
MC readme.txt
spin.c has been modifiedreadme.txt has been modified, and has conflicts
svn update prints one line for each file or directory it does something to$ svn update
A newspin.c
U spin.c
C spin.h
newspin.c has been addedspin.c has been updated (i.e., someone else modified it)- There's a conflict in
spin.h
- Which you'll have to resolve before you can commit your changes
- Comment on this slide
Branching and Merging
- Sometimes want to work on several different versions of software at once
- Example: need to do bug fixes on Version 3 while making incompatible changes toward Version 4
- Or want two sets of developers to be able to write and test large changes independently, then put things back together
- All modern version control systems allow you to branch a repository
- Create a “parallel universe” which is initially the same as the original, but which evolves independently
![[Branching and Merging]](img/version/branch_and_merge.png)
Figure 3.6: Branching and Merging |
- Much better than just copying all the source files: the version control system remembers where the branch came from, and can trace its history back
- Can later merge changes from one branch to another
- Example: fix a bug on one branch, merge the changes into other branches that have the same bug
- Again, much better than copying by hand, since the version control system can keep track of where things came from, and where they went
- Warning: many people become over-excited about branching when they first start to use it
- Keeping track of what's going on where can be a considerable management overhead
- On a small project, very rare to need more than two active branches
- Comment on this slide
Exercises
Exercise 3.1:
Follow the instructions given to you by your instructor to
check out a copy of the Subversion repository you'll be using in
this course. Unless otherwise noted, the exercises below
assume that you have done this, and that your working copy is in
a directory called course. You will submit all of your
exercises in this course by checking files into your
repository.
Exercise 3.2:
Create a file course/ex01/bio.txt (where
course is the root of your working copy of your
Subversion repository), and write a short biography of yourself
(100 words or so) of the kind used in academic journals,
conference proceedings, etc. Commit this file to your
repository. Remember to provide a meaningful comment when
committing the file!
Exercise 3.3:
What's the difference between mv and svn
mv? Put the answer in a file called
course/ex01/mv.txt and commit your changes.
Once you have committed your changes, type svn
log in your course directory. If you didn't know
what you'd just done, would you be able to figure it out from
the log messages? If not, why not?
Exercise 3.4:
In this exercise, you'll simulate the actions of two
people editing a single file. To do that, you'll need to check
out a second copy of your repository. One way to do this is to
use a separate computer (e.g., your laptop, your home computer,
or a machine in the lab). Another is to make a temporary
directory, and check out a second copy of your repository there.
Please make sure that the second copy isn't inside the first, or
vice versa—Subversion will become very confused.
Let's call the two working copies Blue and Green. Do the
following:
a) Create Blue/ex01/planets.txt, and add the
following lines:
Mercury
Venus
Earth
Mars
Jupiter
Saturn
Commit the file.
b) Update the Green repository. (You should get a copy of
planets.txt.)
c) Change Blue/ex01/planets.txt so that it reads:
1. Mercury
2. Venus
3. Earth
4. Mars
5. Jupiter
6. Saturn
Commit the changes.
d) Edit Green/ex01/planets.txt so that its contents
are as shown below. Do not do svn update
before editing this file, as that will spoil the
exercise.
Mercury 0
Venus 0
Earth 1
Mars 2
Jupiter 16 (and counting)
Saturn 14 (and counting)
e) Now, in Green, do svn update. Subversion
should tell you that there are conflicts in planets.txt.
Resolve the conflicts so that the file contains:
1. Mercury 0
2. Venus 0
3. Earth 1
4. Mars 2
5. Jupiter 16
6. Saturn 14
Commit the changes.
f) Update the Blue repository, and check that
planets.txt now has the same content as it has in the
Green repository.
Exercise 3.5:
Add another line or two to course/ex01/bio.txt and
commit those changes. Then, use svn merge to restore
the original contents of your biography
(course/ex01/bio.txt), and commit the result. When you
are done, bio.txt should look the way it did at the end
of the first part of the previous exercise.) Note: the purpose
of this exercise is to teach you how to go back in time to get
old versions of files—while it would be simpler in this
case just to edit bio.txt, you can't (reliably) do that
when you've made larger changes, to multiple files, over a
longer period of time.
Shell Basics
Introduction
- Most modern tools have a graphical user interface (GUI)
- Because they're easier to use
- But command-line user interfaces (CLUIs) still have their place
- Easier (faster) to build new CLUI tools
- Building a GUI takes time
- Building a good GUI takes a lot of time
- Higher action-to-keystroke ratio
- Once you're over the (steeper) learning curve
- Easier to see and understand what the computer is doing on your behalf
- Which is part of what this course is about
- Most important: it's easier to combine CLUI tools than GUI tools
- Small tools, combined in many ways, can be very powerful
-
[Ray & Ray 2003]
is a good introduction for newcomers
- How to tell if you can skip this lecture
- Do you know what a shell is?
- Do you know the difference between an absolute path and a relative path?
- Do you know what a process is?
- Do you know what a pipe is?
- Do you know what
$PATH is? - Do you know what
rwxr-xr-x means?
- Comment on this slide
The Shell vs. the Operating System
- The most important command-line tool is the command shell (often just called “the shell”)
- Manages a user's interactions with the operating system by:
- Reading commands from the keyboard
- Figuring out what programs the user wants to run
- Running those programs
- Displaying their output on the screen
- Looks (and works) like an interactive terminal circa 1980
![[A Shell in Action]](img/shell01/shell_screenshot.png)
Figure 4.1: A Shell in Action |
- The shell is just one program among many
- Many different ones have been written
sh was the first for Unix- Most others extend its capabilities in various ways
- Which means that it's the lowest common denominator you can always rely on
- We'll use
bash (the Bourne again shell) in this course- Available just about everywhere
- Even on Windows (thanks to
Cygwin)
- In contrast, the operating system is not just another program
- Automatically loaded when the computer boots up
- The only program that can talk directly to the computer's hardware
- I.e., read characters from the keyboard, or send drawing commands to the screen
- Manages files and directories on the disk
- Keeps track of who you are, and what you're allowed to do
- You can run many instances of the shell on a computer at once, but it can only run one operating system at a time
![[Operating System and Shell]](img/shell01/os_shell.png)
Figure 4.2: Operating System and Shell |
- Comment on this slide
The File System
- The file system is the set of files and directories the computer can access
- “Everything that stays put when you turn the computer off and restart it”
- Data is stored in files
- By convention, files have two part names, like
notes.txt or home.html - Most operating systems allow you to associate a filename extension with an application
- E.g.,
.txt is associated with an editor, and .html with a web browser
- But this is all just convention: you can call files (almost) anything you want
- Files are stored in directories (often called folders)
- Directories can contain other directories, too
- Results in the familiar directory tree
![[A Directory Tree]](img/shell01/directory_tree.png)
Figure 4.3: A Directory Tree |
- Everything in a particular directory must have a unique name
- Otherwise, how would you identify it?
- But items in different directories can have the same name
- On Unix, the file system has a unique root directory called
/
- Every other directory is a child of it, or a child of a child, etc.
- On Windows, every drive has its own root directory
- So
C:\home\gvwilson\notes.txt is different from J:\home\gvwilson\notes.txt - When you're using Cygwin, you can also write
C:\home\gvwilson as c:/home/gvwilson - Or as
/cygdrive/c/home/gvwilson
- Some Unix programs give
":" a special meaning, so Cygwin needed a way to write paths without it…
- A path is a description of how to find something in a file system
- An absolute path describes a location from the root directory down
- Equivalent to a street address
- Always starts with
"/" - E.g.,
/home/gvwilson is my home directory, and /courses/swc/lec/shell.swc is this file
- A relative path describes how to find something from some other location
- Equivalent to saying, “Four blocks north, and seven east”
- E.g., from
/courses/swc, the relative path to this file is lec/shell.swc
- Every program (including the shell) has a current working directory
- “Where am I?”
- Relative paths are deciphered relative to this location
- It can change while a program is running
- Finally, two special names:
"." means “the current directory”".." means “the directory immediately above this one- Also called the parent directory
- In
/courses/swc/data, .. is /courses/swc - In
/courses/swc/data/elements, .. is /courses/swc/data ![[Parent Directories]](img/shell01/parent_directory.png)
Figure 4.4: Parent Directories |
- Comment on this slide
A Few Simple Commands
- Easiest way to learn basic Unix commands is to see them in action
- First, I type
pwd (short for "print working directory”) to find out where I am- Unfortunately, most Unix commands have equally cryptic names
- I then type
ls (for “listing”) to see what's in the current directoryls
LICENSE.txt admin data graphics lec pdf scraps util
Makefile cgi-bin etc img mp3 publ src web
- What actually happens when I type
ls is:
- The operating system reads characters from the keyboard
- Passes them to the shell (because it's the currently active window on my desktop)
- The shell breaks the line of text it receives into words
- Looks for a program with the same name as the first word (i.e., the command to run)
- Describe in a moment how the shell knows where to look
- Runs that program
- Reads the program's output and sends it back to the operating system for display
![[Running a Program]](img/shell01/shell_running_program.png)
Figure 4.5: Running a Program |
- I can tell
ls to produce more informative output by giving it some flags
- By convention, flags start with
"-", as in "-c" or "-l" - Show directories with trailing slash
ls -F
LICENSE.txt admin/ data/ exer/ lec/ publ/ soln/ util/
Makefile cgi-bin/ etc/ img/ pdf/ scraps/ src/ web/
- Show all files and directories, including those whose names begin with
.
ls -a
. .. .svn admin data
exer img lec publ soln
src util tmpl README.txt license.txt
todo.txt
- By default,
ls doesn't show anything whose name begins with . - Note the
.svn directory: this is where Subversion keeps administrative information - Do not edit anything in this directory yourself, or Subversion will become very confused
- Comment on this slide
Creating Files and Directories
- Rather than messing with the course files, let's create a temporary directory and play around in there
- Note: no output
- The
-v (“verbose”) flag tells mkdir to print a confirmation message
- Now go into that directory
- Changes the shell's notion of our current working directory
pwd
/home/gvwilson/swc/temp
- No files there yet:
- Use the editor of your choice to create a file called
earth.txt with the following contents:
Name: Earth
Period: 365.26 days
Inclination: 0.00
Eccentricity: 0.02
- Notepad (on Windows) runs in a window of its own
- Pico (on Unix) takes over the shell window temporarily
- We'll look at more advanced editing tools for programming in a few lectures
- Easiest way to create a similar file
venus.txt is to copy the one we havels -t
venus.txt earth.txt
- Note: the
-t option tells ls to list newest first
- Check the contents of the file using
cat (short for “concatenate”)
- Just prints the contents of a file to the screen
cat venus.txt
Name: Earth
Period: 365.26 days
Inclination: 0.00
Eccentricity: 0.02
- Edit the file so that it looks like this:
Name: Venus
Period: 224.70 days
Inclination: 3.39
Eccentricity: 0.01
- Compare the sizes of the two files using
wc (for “word count”)
wc earth.txt venus.txt
4 9 69 earth.txt
4 9 69 venus.txt
8 18 138 total
- Columns show lines, words, and characters
- Comment on this slide
Wildcards
- Some characters (called wildcards) mean special things to the shell
* matches zero or more characters- So
ls *.f77 lists all the Fortran-77 files in a directory
wc *.txt
4 9 69 earth.txt
4 9 69 venus.txt
8 18 138 total
? matches any single character- So
ls ??.txt lists all the text files with two-letter prefixes - And
ls ??.* lists all the files with two-letter prefixes, and any extension
~ on its own means “my home directory”
- I.e., the one I'm in when I first log in
~harry means “Harry's home directory”
- Note: the shell expands wildcards before running commands
- There's no way for
ls to know whether it was invoked as ls *.txt or rm earth.txt venus.txt
- Comment on this slide
Exercises
Exercise 4.1:
Suppose you are in your home directory, and ls shows
you this:
Makefile biography.txt data
enrolment.txt programs thesis
What argument(s) do you have to give to ls to get it
to put a trailing slash after the names of subdirectories, like
this:
Makefile biography.txt data/
enrolment.txt programs/ thesis/
If you run ls data, it shows:
earth.txt jupiter.txt mars.txt
mercury.txt saturn.txt venus.txt
What command should you run to get the following output:
data/earth.txt data/jupiter.txt data/mars.txt
data/mercury.txt data/saturn.txt data/venus.txt
What if you want this (note that an extra entry is being
displayed):
total 7
drwxr-xr-x 7 someone 0 May 6 08:27 .svn
-rw-r--r-- 1 someone 2396 May 6 08:38 earth.txt
-rw-r--r-- 1 someone 1263 May 6 08:38 jupiter.txt
-rw-r--r-- 1 someone 1015 May 6 08:43 mars.txt
-rw-r--r-- 1 someone 946 May 6 08:41 mercury.txt
-rw-r--r-- 1 someone 1714 May 6 08:40 saturn.txt
-rw-r--r-- 1 someone 881 May 6 08:40 venus.txt
Note: the command will display your user ID, rather than
someone. On some machines, the command will also display a
group ID. Ignore these differences for the purpose of this
question.
Exercise 4.2:
According to the listing of the data directory above, who
can read the file mercury.txt? Who can write it (i.e., change
its contents or delete it)? When was mercury.txt last
changed? What command would you run to allow everyone to edit or
delete the file?
Exercise 4.3:
Suppose you want to remove all files whose names (not including
their extensions) are of length 3, start with the letter a, and
have .txt as extension. What command would you use? For
example, if the directory contains three files a.txt,
abc.txt, and abcd.txt, the command should remove
abc.txt , but not the other two files.
Exercise 4.4:
What does the command cd ~ do? What about cd
~gvwilson?
Exercise 4.5:
What's the difference between the commands cd HOME
and cd $HOME?
Exercise 4.6:
Suppose you want to list the names of all the text files in the
data directory that contain the word "carpentry". What
command or commands could you use?
Exercise 4.7:
Suppose you have written a program called analyze. What
command or commands could you use to display the first ten lines of
its output? What would you use to display lines 50-100? To send
lines 50-100 to a file called tmp.txt?
Exercise 4.8:
The command ls data > tmp.txt writes a listing of
the data directory's contents into tmp.txt. Anything
that was in the file before the command was run is overwritten. What
command could you use to append the listing to tmp.txt
instead?
Exercise 4.9:
What command(s) would you use to find out how many
subdirectories there are in the lectures directory?
Exercise 4.10:
What does rm *.ch? What about rm
*.[ch]?
Exercise 4.11:
What command(s) could you use to find out how many instances of
a program are running on your computer at once? For example, if you
are on Windows, what would you do to find out how many instances of
svchost.exe are running? On Unix, what would you do to
find out how many instances of bash are running?
Exercise 4.12:
What do the commands pushd, popd,
and dirs do? Where do their names come from?
Exercise 4.13:
How would you send the file earth.txt to the
default printer? How would you check it made it (other than
wandering over to the printer and standing there)?
Exercise 4.14:
A colleague asks for your data files. How would you
archive them to send as one file? How could you compress them?
Exercise 4.15:
The instructor wants you to use a hitherto unknown command
for manipulating files. How would you get help on this command?
Exercise 4.16:
You have changed a text file on your home PC, and mailed
it to the university terminal. What steps can you take to see
what changes you may have made, compared with a master copy in
your home directory?
Exercise 4.17:
How would you change your password?
Exercise 4.18:
grep is one of the more useful tools in the
toolbox. It finds lines in files that match a pattern and
prints them out. For example, assume I have files
earth.txt and venus.txt containing lines like
this:
Name: Earth
Period: 365.26 days
Inclination: 0.00
Eccentricity: 0.02
If I type grep Period *.txt in that
directory, I get:
earth.txt:Period: 365.26 days
venus.txt:Period: 224.70 days
Search strings can use regular expressions, which will be
discussed in a later lecture.
grep takes many options as well; for example,
grep -c /bin/bash /etc/passwd reports how many lines
in /etc/passwd (the Unix password file) that contain the
string /bin/bash, which in turn tells me how many users
are using bash as their shell.
Suppose all you wanted was a list of the files that
contained lines matching a pattern, rather than the matches
themselves—what flag or flags would you give to
grep? What if you wanted the line numbers of
matching lines?
Exercise 4.19:
diff finds and displays the differences between
two files. It works best if both files are plain text (i.e.,
not images or Excel spreadsheets). By default, it shows the
differences in groups, like this:
3c3,4
< Inclination: 0.00
---
> Inclination: 0.00 degrees
> Satellites: 1
(The rather cryptic header "3c3,4" means that line 3
of the first file must be changed to get lines 3-4 of the
second.)
What flag(s) should you give diff to tell it to
ignore changes that just insert or delete blank lines? What if
you want to ignore changes in case (i.e., treat lowercase and
uppercase letters as the same)?
Exercise 4.20:
Suppose you wanted ls to sort its output by
filename extension, i.e., to list all .cmd files before
all .exe files, and all .exe's before all
.txt files. What command or commands would you
use?
More Shell
Redirecting Input and Output
- A running program is called a process
- By default, every process has three connections to the outside world:
- You can tell the shell to connect standard input and standard output to files instead
command < inputFile reads from inputFile instead of from the keyboard- Don't need to use this very often, because most Unix commands let you specify the input file (or files) as command-line arguments
command > outputFile writes to outputFile instead of to the screen- Only “normal” output goes to the file, not error messages
command < inputFile > outputFile does both![[Redirecting Standard Input and Output]](img/shell02/redirecting_stdio.png)
Figure 5.2: Redirecting Standard Input and Output |
- Example: save number of lines in all text files to
words.len
$ wc -w *.txt > words.len
- Nothing appears, because output is being sent to the file
words.len $ ls -t
words.len venus.txt earth.txt
$ cat words.len
4 9 69 earth.txt
4 9 69 venus.txt
3 12 62 words.len
11 30 200 total
- Try typing
cat > junk.txt
- No input file specified, so
cat reads from the keyboard - Output sent to a file
- Voila: the world's dumbest text editor
- When you're done, use
rm junk.txt to get rid of the file- Don't type
rm * unless you're really, really sure that's what you want to do…
- Comment on this slide
Pipes
- Suppose you want to use the output of one program as the input of another
- E.g., use
wc -w *.* to count the words in some files, then sort -n to sort numerically
- Option 1: send output of first command to a temporary file, then read from that file
wc *.txt > temp
sort -n < temp
- Option 2: use a pipe to connect the two programs
- Written as
"|" - Tells the operating system to send what the first program writes to its stdout to the second program's stdin
wc -w *.* | sort -n
9 earth.txt
9 venus.txt
12 words.len
30 total
![[Pipes]](img/shell02/pipes.png)
Figure 5.3: Pipes |
- More convenient (and much less error prone) than temporary files
- Can chain any number of commands together
- And combine with input and output redirection
wc *.txt | sort -n | head -5 > shortest.files
- Any program that reads from standard input and writes to standard output can use redirection and pipes
- Programs that do this are often called filters
- If you make your programs work like filters, you can easily combine them with others
- A combinatorial explosion of goodness
- Comment on this slide
Environment Variables
- Like any other program, the shell has variables
- Since they define a user's environment, they are usually called environment variables
- Type
set at the command prompt to get a listing:
$ set
ANT_HOME=C:/apache-ant-1.6.2
BASH=/usr/bin/bash
COLUMNS=80
COMPUTERNAME=ISHAD
HISTFILESIZE=500
- Get a particular variable's value by putting a
"$" in front of its name- E.g., the shell replaces
"$HOME" with the current user's home directory - Often use the
echo command to print this out$ echo $HOME
/home/gvwilson
- Question: why must you type
echo $HOME, and not just $HOME?
- To set or reset a variable's value temporarily, use this:
- Only affects the current shell (and programs run from it)
- To set a variable's value automatically when you log in, set it in
~/.bashrc
- Remember,
"~" is a shortcut meaning “your home directory”
- For me, right now,
~/.bashrc is /home/gvwilson/.bashrc
- Important environment variables
| Name | Typical Value | Notes |
|---|
HOME | /home/gvwilson | The current user's home directory |
HOMEDRIVE | C: | The current user's home drive (Windows only) |
HOSTNAME | "ishad" | This computer's name |
HOSTTYPE | "i686" | What kind of computer this is |
OS | "Windows_NT" | What operating system is running |
PATH | "/home/gvwilson/bin:/usr/local/bin:/usr/bin:/bin:/Python24/" | Where to look for programs |
PWD | /home/gvwilson/swc/lec | Present working directory (sometimes CWD, for current working directory) |
SHELL | /bin/bash | What shell is being run |
TEMP | /tmp | Where to store temporary files |
USER | "gvwilson" | The current user's ID |
Table 5.1: Important Environment Variables
- Comment on this slide
How the Shell Finds Programs
- The most important of these variables is
PATH
- The search path that tells the shell where to look for programs
- When you type a command like
tabulate, the shell:
- Splits
$PATH on colons to get a list of directories - Looks for the program in each directory, in left-to-right order
- Runs the first one that it finds
- Example
PATH is /home/gvwilson/bin:/usr/local/bin:/usr/bin:/bin:/Python24- Both
/usr/local/bin/tabulate and /home/gvwilson/bin/tabulate exist /home/gvwilson/bin/tabulate will be run when you type tabulate at the command prompt- Can run the other one by specifying the path, instead of just the command name
- Warning: it is common to include
. in your path- This allows you to run a program in the current directory just by typing
whatever, instead of ./whatever - But it also means you can never be quite sure what program a command will invoke
- Though you can use the command
which program_name, which will tell you
- Common entries in
PATH include:
/bin, /usr/bin: core tools like ls
- Note: the word “bin” comes from “binary”, which is geekspeak for “a compiled program”
/usr/local/bin: optional (but common) tools, like the gcc$HOME/bin: tools you have built for yourself- Remember,
$HOME means “the user's home directory” - So this is equivalent to
~/bin
- Cygwin does things a little differently
- Uses the notation
/cygdrive/c/somewhere instead of Windows' c:/somewhere
- The colon in
c:/somewhere would clash with the colons in the PATH variable
- By default, Cygwin treats
c:/cygwin as the root of its file system- So
/home/aturing is a synonym for c:/cygwin/home/aturing - Yes, it can be confusing, but remember: we're trying to run one operating system's tools on top of another
- Comment on this slide
Basic Tools
man | Documentation for commands. |
cat | Concatenate and display text files. |
cd | Change working directory. |
chmod | Change file and directory permissions. |
clear | Clear the screen. |
cp | Copy files and directories. |
date | Display the current date and time. |
diff | Show differences between two text files. |
echo | Print arguments. |
env | Show environment variables. |
head | Display the first few lines of a file. |
ls | List files and directories. |
mkdir | Make directories. |
more | Page through a text file. |
mv | Move (rename) files and directories. |
od | Display the bytes in a file. |
passwd | Change your password. |
pwd | Print current working directory. |
rm | Remove files. |
rmdir | Remove directories. |
sort | Sort lines. |
tail | Display the last few lines of a file. |
uniq | Remove duplicate lines. |
wc | Count lines, words, and characters in a file. |
which | locate a command |
Table 5.2: Basic Command-Line Tools
- Comment on this slide
Ownership and Permission: Unix
- On Unix, every user belongs to one or more groups
- The
groups command will show you which ones you are in
- Every file is owned by a particular user and a particular group
- Owner can assign different read, write, and execute permissions to user, group, and others
- Read: can look at contents, but not modify them
- Write: can modify contents
- Execute: can run the file (e.g., it's a program)
ls -l will show all of this information- (Along with the file's size and a few other things)
- Permissions displayed as three
rwx triples - “Missing” permissions shown by
"-" - Example:
rw-rw-r-- means “user and group can read and write; everyone else can read; no one can execute”
- Change permissions using
chmod
- Example:
chmod u+x something.exe gives the user execute permission to something.exe - Example:
chmod o-r notes.txt takes away the world's read permission for notes.txt
- Permissions mean something a little different for directories
- Execute permission means you can “go into” a directory, but does not mean you can read its contents
- So if a directory called
tools has permission rwx--x--x (i.e., owner can do anything, but everyone else only has execute permission), then:
- If someone other than the owner does
ls tools, permission is denied - But if there's a useful program called
tools/findanswers, other users can still run it
- Comment on this slide
Ownership and Permission: Windows
- Of course, it all works differently on Windows
- Not better or worse, just differently
- Windows XP uses access control lists (ACLs)
- Instead of describing users as “file owner, group member, or something else”, ACLs let you specify exactly what any particular user, or set of users, can do to a file, directory, device, etc.
- Older versions of Windows (such as Windows 95 and Windows 2000) are fundamentally insecure, and shouldn't be used
- Cygwin does its best to make the Windows model look like Unix's
- If you trip over the differences, please consult a system administrator
- Comment on this slide
More Advanced Tools
du | Print the disk space used by files and directories. |
find | Find files that match a pattern. |
grep | Print lines matching a pattern. |
gunzip | Uncompress a file. |
gzip | Compress a file. |
lpr | Send a file to a printer. |
lprm | Remove a print job from a printer's queue. |
lpq | Check the status of a printer's queue. |
ps | Display running processes. |
tar | Archive files. |
which | Find the path to a program. |
who | See who is logged in. |
xargs | Execute a command for each line of input. |
Table 5.3: Advanced Command-Line Tools
- Comment on this slide
Exercises
Exercise 5.1:
You're worried your data files can be read by your
nemesis, Dr. Evil. How would you check whether or not he can,
and if necessary change permissions so only you can read or
write the files?
Basic Scripting
Why Python?
- Two factors determine time to solution:
- How long it takes to write a program (human time)
- How long it takes that program to run (machine time)
- Different languages make different tradeoffs between programming speed and execution speed
- Programmers write the same number of lines of code per day no matter what language they're using, so high-level languages are good
- But the more abstract the language, the more work it is for the computer to figure out what you want it to do
![[Nimble vs. Sturdy Languages]](img/py01/nimble_vs_sturdy.png)
Figure 6.1: Nimble vs. Sturdy Languages |
Python is:
- Like the shell, only better
- Freely available for many platforms
- Widely used
- Well documented
- (Much) easier to read than Perl
- Material that took three days to teach in Perl took only two to teach in Python
- Follow-up surveys showed significantly higher retention rates
- (Much) slower than C/C++ or Fortran
- 10-100 times slower than compiled, optimized code
- But it's relatively easy to call C/C++ and Fortran libraries from Python
- Doesn't have all of MATLAB's numerical tools
- But its
Numeric package isn't bad - And it has a lot of things MATLAB doesn't
- This course isn't really about Python
- It's about solving simple problems with the least effort
- But we have to write the examples in something
- So we might as well choose something simple and useful
- For more information:
- Comment on this slide
Running Python Interactively
- Running a program in a sturdy language is a two-step process:
- Compiler translates source code into something that can run
- That “something” then runs
- Directly on the hardware (C, C++, Fortran)
- On a virtual machine (Java)
- Some combination of the two (C#)
- Nimble languages typically combine the compiler and the virtual machine
- A single program reads the user's code, translates it into something executable, and executes it right away
![[Sturdy vs. Nimble Execution]](img/py01/sturdy_vs_nimble_execution.png)
Figure 6.2: Sturdy vs. Nimble Execution |
- This means that most nimble languages can run interactively, like a shell
$ python
>>> 3 + 5
8
>>> x = 5 * 3 ** 2 # assign 5 times 3 squared to x
>>> print x
45
>>> print 'some' + 'thing' # concatenate strings
something
- Comment on this slide
Running Saved Programs
- Obviously don't have to retype program every time you want to run it
- Option 1: save program in a file with a
.py extension, and type python filename.py
- Python reads and executes the commands in the file exactly as if they'd been typed in interactively
- Option 2 (Unix only): make the following the first line of the
.py file- This tells Unix to look up a program called
python, and run it with the rest of the file as its input
- Option 3 (Windows only): associate
.py files with Python- Double-clicking on anything ending in
.py will then run it - Happens automatically when you run the Python Windows installer
- Example
- Using an editor, put the following in
powers.py print 2, 2**2, 2**3, 2**4
print 3, 3**2, 3**3, 3**4
print 4, 4**2, 4**3, 4**5
- Run using
python powers.py - Should see the following
2 4 8 16
3 9 27 81
4 16 64 1024
- Comment on this slide
Variables
- Variables are names for values
- No declarations: variables are created when something is assigned to them
![[Variables Refer to Values]](img/py01/vars_refer_to_values.png)
Figure 6.3: Variables Refer to Values |
- No types: a variable is just a name, and can refer to different types of values at different times
- Although your code will be easier to understand if you don't abuse this
- Must give a variable a value before using it
- Unlike some languages, Python doesn't try to guess a “sensible” value
print something
Traceback (most recent call last):
File "undefined.py", line 1, in ?
print something
NameError: name 'something' is not defined
- While variables don't have types, values do
- If you try to operate on incompatible values, Python will complain
x = 'two' # 'two' is a string
y = 2 # 2 is an integer
print x * y # multiplying a string concatenates it repeatedly
print x + y # but you can't add an integer and a string
twotwo
Traceback (most recent call last):
File "add_int_str.py", line 4, in ?
print x + y # but you can't add an integer and a string
TypeError: cannot concatenate 'str' and 'int' objects
- Comment on this slide
Printing and Quoting
- The
print statement prints zero or more values- Separated by spaces
- Automatically puts a newline at the end
- So
print on its own just prints a blank line
- Use either single or double quotes to create strings
- Must be consistent: if a string starts with one kind of quote, it must end with that kind of quote
- But different strings in the same program can use different kinds of quotes
- This was not necessarily a good design decision…
print 'a', "b", '"c"', "'d'"
a b "c" 'd'
- The built-in function
str converts things to stringsprint 'carbon-' + str(14)
carbon-14
- Use similar functions called
int, float, etc. to convert values to other types
- Use escape sequences to put special characters in strings
- Single quotes in a single-quoted string:
'This isn\'t unusual.' - Double quotes in a double-quoted string:
"She said, \"You can quote me on that!\"" - Tab and newline characters:
'\tIndented line\n'
- Comment on this slide
Numbers and Arithmetic
- The usual numeric types
14 is an integer (32 bits long on most machines)14.0 is floating point (double precision, i.e., 64 bits long)
- Two unusual numeric types
1+4j is a complex number (two 64-bit values)123456789L is a long integer- Arbitrary length: uses as much memory as it needs to
- Operations are several times slower
- Python automatically switches to long-integer mode when it needs to
- Python borrows C's numeric operators
| Name | Operator | Use | Value | Notes |
|---|
| Addition | + | 35 + 22 | 57 | |
| | 'Py' + 'thon' | 'Python' | |
| Subtraction | - | 35 - 22 | 13 | |
| Multiplication | * | 3 * 2 | 6 | |
| | 'Py' * 2 | 'PyPy' | 2 * 'Py' is illegal |
| Division | / | 3.0 / 2 | 1.5 | |
| | 3 / 2 | 1 | Integer division rounds down: -3 / 2 is -2, not -1 |
| Exponentiation | ** | 2 ** 0.5 | 1.4142135623730951 | |
| Remainder | % | 13 % 5 | 3 | |
Table 6.1: Numeric Operators in Python
- Python also supports C's in-place operators
x += 3 does the same thing as x = x + 35 += 3 is an error, since you can't assign a new value to 5…
- Comment on this slide
Booleans
True and False are true and false (d'oh)- Empty string, 0, and
None are equivalent to false- Just as 3 is equivalent to 3.0
- (Almost) everything else is true
- Combine Booleans using
and, or, not
and and or are short-circuit operators- Evaluate expressions left to right, and stop as soon as they know the answer
- Result is the last thing evaluated, rather than
True or False | Expression | Result | Notes |
|---|
True or False | True | |
True and False | False | |
'a' or 'b' | 'a' | or is true if either side is true, so it stops after evaluating 'a' |
'a' and 'b' | 'b' | and is only true if both sides are true, so it doesn't stop until it has evaluated 'b' |
0 or 'b' | 'b' | 0 is false, but 'b' is true |
0 and 'b' | 0 | Since 0 is false, Python can stop evaluating there |
0 and (1/0) | 0 | 1/0 would be an error, but Python never gets that far |
(x and 'set') or 'not set' | It depends | If x is true, this expression's value is 'set'; if x is false, it is 'not set' |
Table 6.2: Boolean Operators in Python
- Comment on this slide
Comparisons
- Python borrows C's comparison operators, too
- But allows you to chain comparisons together, just as in mathematics
| Expression | Value |
|---|
3 < 5 | True |
3.0 < 5 | True |
3 != 5 | True |
3 == 5 | False |
3 < 5 <= 7 | True |
3 < 5 >= 2 | True (but please don't write this—it's hard to read) |
3+2j < 5 | Error: can only use == and != on complex numbers |
Table 6.3: Comparison Operators in Python
- Note the difference between assignment and testing for equality
- Use a single equals sign
= for assignment - Use a double equals sign
== to test if two things have equal values
- String comparison may not do what you expect
- Characters are encoded as numbers: digits come before uppercase letters, all of which come before lowercase letters
- Punctuation is mixed in between, just to make matters difficult
- Strings are compared character by character from first to last until:
- One character is less than another
- One string runs out of characters
| Expression | Value |
|---|
'abc' < 'def' | True |
'abc' < 'Abc' | False |
'ab' < 'abc' | True |
'0' < '9' | True |
'100' < '2' | True |
Table 6.4: Python String Comparisons
- Comment on this slide
Conditionals
- Python uses
if, elif (not else if), and else - Use a colon and indentation to show nesting
a = 3
if a < 0:
print 'less'
elif a == 0:
print 'equal'
else:
print 'greater'
- Why indentation?
- Based on studies from the 1980s, it's what everyone looks at anyway
- Just count the number of warnings in C/Java books about misleading indentation
- Doesn't matter how much you use, but:
- Everything in the block must be indented the same amount
- Tabs are expanded so that they're equivalent to (up to) eight characters
- So don't ever indent with tabs, since your editor may interpret them differently
- Many editors understand Python indentation, and will help you get it right
- If you're fond of legacy character-oriented editors,
Emacs and Vim both work well - If you want a free integrated development environment (IDE), try
SPE or PyDev - If you'd prefer something with commercial support,
Komodo and WingIDE are my favorites
- Comment on this slide
While Loops, Break, and Continue
- Do something repeatedly as long as some condition is true
- Again, use colon and indentation to show nesting
a = 3
while a > 0:
print a
a -= 1
3
2
1
- Do the “something” zero times if the condition is false the first time it is tested
print 'before'
a = -1
while a > 0:
print a
a -=1
print 'after'
before
after
- If the condition is always true, the loop never ends
a = 3
while a > 0:
print a
# oops --- forgot to subtract one from a
3
3
3
- Can break out of the middle of a loop using
break
a = 30
while True:
print a
a -= 1
if (a % 5) == 0: # if a is divisible by 5
break
30
29
28
27
26
- Don't abuse this: put the test at the top of the loop unless there's a really good reason not to
- Can skip to the next iteration using
continue
a = 5
while a > 0:
print 'top of loop', a
a -= 1
if (a % 2) == 0:
continue
print '...bottom of loop', a
top of loop 5
top of loop 4
...bottom of loop 3
top of loop 3
top of loop 2
...bottom of loop 1
top of loop 1
- Again, don't abuse this by writing loops that are hard for other people to figure out
- If only because “other people” includes you three months from now
- Comment on this slide
Strings, Lists, and Files
Where We Just Were
- Python is a nimble language with:
- Numbers, strings, and Booleans
- Conditionals (
if, elif, and else) while loops (with break and continue)
- This lecture describes:
- Lists, for storing collections of values
- Functions, which reduce redundancy, and make programs easier to read
- Reading and writing files
- Comment on this slide
But First, Strings
- A string is an immutable sequence of characters
- Immutable means that it cannot be modified once it has been created
- I.e., you cannot change individual characters in place
str = 'abc'
print 'str is', str
str[0] = 'x'
print 'str is now', str
str is abc
Traceback (most recent call last):
File "immutable_err.py", line 3, in ?
str[0] = 'x'
TypeError: object does not support item assignment
- Though you can of course assign a new string value to a variable
str = 'abc'
print 'str is', str
str = 'xyz'
print 'str is now', str
str is abc
str is now xyz
- Sequence means that it can be indexed
- Indices start at 0 (as they do in C)…
- …so
text[0] is the first character of the string text - The built-in function
len returns the length of a string… - …so the index of the last character of
text is len(text)-1
- Note: there is no separate data type for characters
- A character is simply a string of length 1
element = "boron"
i = 0
while i < len(element):
print element[i]
i += 1
b
o
r
o
n
- Comment on this slide
Slicing, Bounds, and Negative Indices
text[start:end] takes a slice out of text
- Creates a new string containing the characters of
text from start up to (but not including) end val = "helium"
print val[1:3], val[:2], val[4:]
el he um
- Sometimes helps to think of indices as being between elements
![[Visualizing Indices]](img/py02/index_between.png)
Figure 7.1: Visualizing Indices |
- A few logical consequences:
text[1:2] is either the second character in text, or the empty string (if text doesn't have a second character)- So
text[1:1] is always the empty string- From index 1, up to but not including index 1
- And
text[2:1] is always the empty string
- Negative indices count backward from the end of the string
x[-1] is the last character- A lot easier to read than
x[len(x)-1]
x[-2] is the second-to-last character![[Visualizing Negative Indices]](img/py02/index_between_negative.png)
Figure 7.2: Visualizing Negative Indices |
val = "carbon"
print val[-2], val[-4], val[-6]
o r c
- Bounds checking rules:
- Python always does an out-of-bounds check when you index a single item
- But it truncates out-of-range indices when you take a slice
val = "helium"
print val[1:22]
x = val[22]
elium
Traceback (most recent call last):
File "bounds.py", line 3, in ?
x = val[22]
IndexError: string index out of range
- Comment on this slide
String Methods
- A method is a function that's tied to a particular object
- Invented to help programmers organize their code
- To call method
M of object X, type X.M() - We'll see how to create objects and methods of our own later
- Almost everything in Python has methods
- Numbers are the only important exception
- String methods
| Method | Purpose | Example | Result |
|---|
capitalize | Capitalize first letter of string | "text".capitalize() | "Text" |
lower | Convert all letters to lowercase. | "aBcD".lower() | "abcd" |
upper | Convert all letters to uppercase. | "aBcD".upper() | "ABCD" |
strip | Remove leading and trailing whitespace (blanks, tabs, newlines, etc.) | " a b ".strip() | "a b" |
lstrip | Remove whitespace at left (leading) edge of string. | " a b ".lstrip() | "a b " |
rstrip | Remove whitespace at right (trailing) edge of string. | " a b ".rstrip() | " a b" |
count | Count how many times one string appears in another. | "abracadabra".count("ra") | 2 |
find | Return the index of the first occurrence of one string in another, or -1. | "abracadabra".find("ra") | 2 |
| | "abracadabra".find("xyz") | -1 |
replace | Replace occurrences of one string with another. | "abracadabra".replace("ra", "-") | "ab-cadab-" |
Table 7.1: String Methods
- Notes
- These methods don't have to be called on constant strings
text = "Hogwarts"
print text.upper()
print text.replace("rts", "sh")
HOGWARTS
Hogwash
- These methods create new strings: they do not change the strings they're called on
text = "Hogwarts"
print "method result is", text.replace("rts", "sh")
print "variable text is", text
method result is Hogwash
variable text is Hogwarts
- You can chain method calls together, and mix with indexing, concatenation, etc.
- If the result of one method call is a string, you can immediately call a method on that result
text = "Hermione"
print ':' + text.upper()[4:7].center(10) + ':'
: ION :
- Use this in moderation: long chains of method calls are hard to read and debug
- You can use
is to check whether one string appears in anotherprint "atta" in "gatcattcgat"
print "tcga" in "gatcattcgat"
False
True
- Comment on this slide
Lists
- A list is a mutable sequence of objects
- Mutable means that, unlike a string, it can be changed in place
- Of objects means that lists can hold anything and everything
- Think of it as a one-dimensional array, or vector, that automatically resizes itself as needed
- Write lists using square brackets:
[], [3], [5, "b"]
- The empty list is considered false
- Indexing and slicing work almost the same way for strings and lists:
- Indexing or slicing a string always returns a string
- Indexing a list returns a single item
- Slicing a list returns a new list
x = ["a", 2, "bcd", 4]
print x[0]
print x[-1]
print x[1:-1]
a
4
[2, 'bcd']
- As before, Python always checks bounds on element access, but truncates bounds on slices
- Assign a new value to a list element using
x[i] = v
gases = ["helium", "argon", "neon"]
print gases
i = 0
while i < len(gases):
gases[i] = "changed"
print gases
i += 1
['helium', 'argon', 'neon']
['changed', 'argon', 'neon']
['changed', 'changed', 'neon']
['changed', 'changed', 'changed']
- Adding two (or more) lists concatenates their elements
- Creates a new list
left = ['L', 'E', 'F', 'T']
right = ['W', 'R', 'I', 'T', 'E']
both = left + right
print both
['L', 'E', 'F', 'T', 'W', 'R', 'I', 'T', 'E']
- Can't concatenate a string and a list
- But
list(text) creates a list whose elements are the characters of the string text
print list("Dumbledore")
['D', 'u', 'm', 'b', 'l', 'e', 'd', 'o', 'r', 'e']
del deletes something from the list, and shortens the list- The “something” can be a slice, in which case several items are deleted
values = ['a', 'b', 'c', 'd', 'e']
print 'original', values
del values[2]
print 'after deleting item 2', values
del values[-2:]
print 'after deleting last two remaining items', values
original ['a', 'b', 'c', 'd', 'e']
after deleting item 2 ['a', 'b', 'd', 'e']
after deleting last two remaining items ['a', 'b']
- Note:
del modifies the list, and doesn't return anything
- Comment on this slide
List Methods
- Like strings, lists are objects, and have methods
- In the examples below, assume
x is initially ['c', 'a', 'b', 'c'] | Method | Purpose | Example | Result |
|---|
append | Add to the end of the list. | x.append(99) | ['c', 'a', 'b', 'c', 99] |
count | Count how many times something appears in the list. | x.count('c') | 2 |
index | Find the first occurrence of something in the list. | x.index('a') | 1 |
insert | Insert something into the list. | x.insert(2, 'diamond') | ['c', 'a', 'diamond', 'b', 'c'] |
remove | Remove the first occurrence of something from the list. | x.remove('c') | ['a', 'b', 'c'] |
reverse | Reverse the list in place. | x.reverse() | ['c', 'b', 'a', 'c'] |
sort | Sort the list in place. | x.sort() | ['a', 'b', 'c', 'c'] |
Table 7.2: List Methods
- Notes
- It's an error if
list.index can't find the item it's searching for- The next lecture describes how to handle errors
list.reverse and list.sort change the list, and return None
x = x.reverse() is a common error- It reverses
x, but then sets x to None, so all data is lost
- Comment on this slide
For Loops and Ranges
- Python's
for statement automatically loops over the content of a collectionfor x in text sets x to each character of text in turnfor x in someList sets x to each element of someList in turnfor c in 'lead':
print '+' + c + '+',
print
for x in ['helium', 'argon', 'neon']:
print x.capitalize()
+l+ +e+ +a+ +d+
Helium
Argon
Neon
- The trailing comma in the
print statement suppresses the automatic newline
- The built-in function
range creates the list [start, start+1, ..., end-1]
- Note:
end-1, just as x[start:end] is the elements of x up to, but not including, end range(end) is the same as range(0, end)range(start, end, step) goes in increments of step- May generate an empty list
print range(2, 5)
print range(3)
print range(2, 10, 3)
print range(3, 1)
[2, 3, 4]
[0, 1, 2]
[2, 5, 8]
[]
- So:
- To loop from 0 to N-1, use
for i in range(N) - To loop over the indices of a list or string, use
for i in range(len(sequence)) chars = "abc"
for i in range(len(chars)):
print i, chars[i]
0 a
1 b
2 c
- Comment on this slide
Membership
x in c is True if the value x is in the collection c
- Equivalent to a substring test on strings
'x' in 'Harry' is False'ar' in 'Harry' is True
- Works element-by-element on lists
3 in [1, 2, 3, 4] is True[2, 3] in [1, 2, 3, 4] is False
vowels = 'aeiou'
for v in vowels:
if v in 'uranium':
print v
a
i
u
- Comment on this slide
Nesting Lists
- Since lists can contain anything, you can build lists of lists of lists of…
- Example: use a list of two values to represent a point, and a list of points to represent a line segment
![[Line Segment]](img/py02/line_segment.png)
Figure 7.3: Line Segment |
- Write indices left to right to select elements from the outside in
x = [[13, 17, 19], [23, 29]]
print x[1]
print x[0][1:3]
[23, 29]
[17, 19]
- The nested lists are objects in their own right
- The outer list stores a reference to the inner list
- But the inner list does not know that it's being referred to
- Object in Python (almost) never know who's pointing at them
- When you index the outer list, you get an alias for the inner list
- Another name for the same data
- Changes made through either reference update the same data
x = [['a', 'b'], ['c', 'd', 'e']]
y = x[0]
print 'before: nested list is', x
print '...and sublist is', y
y[0] = 123
print 'after: nested list is', x
print '...and sublist is', y
before: nested list is [['a', 'b'], ['c', 'd', 'e']]
...and sublist is ['a', 'b']
after: nested list is [[123, 'b'], ['c', 'd', 'e']]
...and sublist is [123, 'b']
![[Aliasing In Action]](img/py02/aliasing.png)
Figure 7.4: Aliasing In Action |
- I lied to you earlier when I said that indexing and slicing does the same thing on lists that it does on strings
- The technical term is “pedagogically motivated misdirection”
- Indexing returns whatever the list itself contains
- In the case of a nested list, this is a reference to the sublist
- Slicing returns a new list containing the selected elements of the original list
- So changes to a slice do not affect the original list
- Can't tell the difference with strings, since there's no way to update one in place anyway
x = [['a', 'b'], ['c', 'd', 'e'], ['f', 'g']]
y = x[0:2]
print 'before: nested list is', x
print '...and slice is', y
y[0] = 123
print 'after: nested list is', x
print '...and modified slice is', y
before: nested list is [['a', 'b'], ['c', 'd', 'e'], ['f', 'g']]
...and slice is [['a', 'b'], ['c', 'd', 'e']]
after: nested list is [['a', 'b'], ['c', 'd', 'e'], ['f', 'g']]
...and modified slice is [123, ['c', 'd', 'e']]
![[Slicing Copies Data]](img/py02/slice_copy.png)
Figure 7.5: Slicing Copies Data |
- Yes, it can be confusing…
- …but anything else would be unacceptably slow…
- …so draw diagrams when you need to…
- …and don't write overly-complicated code in the first place
- Comment on this slide
Tuples
- Python actually has a second type of list, called a tuple
- Just like a normal list, but immutable (i.e., can't be changed after creation)
- Written using parentheses instead of square brackets:
(1, 2, 3) instead of [1, 2, 3] - Empty tuple is
() - Tuple with one element must be written with a comma, as in
(55,)
- Because
(55) has to be just the integer 55, or the mathematicians will get upset
- Why bother?
- Performance: if you know the length of the sequence isn't going to change, you can do some things more quickly
- There are places where Python needs immutable objects that contain other values
- We'll meet them in the next lecture
- I still think it's a wart…
- Tuples allow Python to provide multi-valued assignment
a, b = 2, 3 assigns 2 to a, and 3 to b
- Equivalent to
a, b = (2, 3)
a, b, c = [1, 2, 3] does what you'd expecta, b = [1, 2, 3, 4] is an error- Number of targets on the left must match the number of values on the right
a, b = b, a swaps the values in a and b
- Use multi-valued assignment in
for loops to unpack structures on the flycorners = [[1, 3], [2, 7], [4, 0], [4, 9]]
sumX, sumY = 0, 0
for cX, cY in corners:
sumX += cX
sumY += cY
sumX = float(sumX) / len(corners)
sumY = float(sumY) / len(corners)
print "center is", sumX, sumY
center is 2.75 4.75
- Comment on this slide
Files
- Use the built-in function
open() to open a file- First argument is path
- Second is
'r' (for read) or 'w' for write - Result is a file object with methods for input and output
infile = open('file.txt', 'r')
outfile = open('copy.txt', 'w')
line = infile.readline()
while line:
outfile.write(line)
line = infile.readline()
infile.close()
outfile.close()
- Here's what just happened:
- First statement opens
file.txt for reading, and assigns the file object to input - Second statement opens
copy.txt for writing, and assigns the file object to output - Program then tries to read a line from
input
- If the file is empty,
line is assigned the value None - Like
Null in C, or null in Java, it means “nothing here”, and is considered false
- As long as there are lines in the input file, the program:
- Writes the most recent line to
output - Reads the next line
- When the input file is exhausted, the program closes
input and output
- Python will close the files automatically when the program exits…
- …but it's good practice to tidy up your toys when your done playing with them
- Comment on this slide
Other Ways to Do It
- Could also copy files like this:
infile = open('file.txt', 'r')
outfile = open('copy.txt', 'w')
for line in infile:
outfile.write(line)
infile.close()
outfile.close()
- Or this:
infile = open('file.txt', 'r')
contents = infile.readlines()
infile.close()
outfile = open('copy.txt', 'w')
outfile.writelines(contents)
outfile.close()
- Moral of the story: get to know the standard library of whatever language you're using
- A week of hard work can sometimes save you an hour of reading, too…
- Comment on this slide
Exercises
Exercise 7.1:
What does "aaaaa".count("aaa") return? Why?
Exercise 7.2:
What does the built-in function enumerate do? Use it to
write a function called findOver that takes a list of numbers
called values, and a number called threshold, as
arguments, and returns a list of the locations where items in
values are greater than threshold. For example,
findOver([1.1, 3.8, -1.6, 7.4], 2.0) should return [1, 3],
since the values in the input list at locations 1 and 3 are
greater than the threshold 2.0.
Exercise 7.3:
What do each of the following five code fragments do? Why?
x = ['a', 'b', 'c', 'd']
x[0:2] = []
|
x = ['a', 'b', 'c', 'd']
x[0:2] = ['q']
|
x = ['a', 'b', 'c', 'd']
x[0:2] = 'q'
|
x = ['a', 'b', 'c', 'd']
x[0:2] = 99
|
x = ['a', 'b', 'c', 'd']
x[0:2] = [99]
|
Exercise 7.4:
What does 'a'.join(['b', 'c', 'd']) return? If you have
a list of strings, how can you concatenate them in a single statement?
Why do you think join is written this way, rather than as
['b', 'c', 'd'].join('a')?
Functions, Libraries, and the File System
Where We Just Were
- Python has:
- The usual primitive data types (numbers, strings, Booleans)
- The usual ways to combine statements (loops and conditionals)
- Lists for storing collections of data
- This lecture describes:
- Functions, for reusing code, and making it easier to understand
- Modules, for building libraries
- Four commonly-used libraries
- Comment on this slide
Defining Functions
- Define a new function using
def
- Argument names follow in parentheses
# Define function.
def ave(x, y):
return (x + y) / 2.0
# Use function.
print ave(20, 30)
25.0
- Exit at any time with
return
def sign(x):
if x < 0:
return -1
if x == 0:
return 0
return 1
for i in [-17.0, 33.3, 0.0]:
print i, sign(i)
-17.0 -1
33.3 1
0.0 0
- Functions with lots of
return statements tend to be hard to understand
- Every function returns something
return on its own is the same as return None- Functions without
return statements also return None
list.sort and list.reverse are examples
def double_elt(x):
for i in range(len(x)):
x[i] = x[i] * 2
values = [3, 'xyz', [9]]
print "values before call:", values
result = double_elt(values)
print "result of call:", result
print "values after call:", values
values before call: [3, 'xyz', [9]]
result of call: None
values after call: [6, 'xyzxyz', [9, 9]]
- Comment on this slide
Scope
- Variables created in function are local to it
- Fresh copies created for each call
- Values discarded when the function returns
- As always, variables must be defined before they can be used
- Python manages variables using a call stack
# Global variable called 'x'.
x = 123
# Function that defines a local variable called 'x'.
def f(arg):
x = arg
print "in call, x is", x
# Call the function to prove that it uses its local 'x'.
print "before call, global x is", x
f(999)
print "after call, global x is", x
before call, global x is 123
in call, x is 999
after call, global x is 123
![[Call Stack (a)]](img/py03/call_stack_a.png)
Figure 8.1: Call Stack (a) |
![[Call Stack (b)]](img/py03/call_stack_b.png)
Figure 8.2: Call Stack (b) |
![[Call Stack (c)]](img/py03/call_stack_c.png)
Figure 8.3: Call Stack (c) |
- Each time a function is called, Python creates new variables, and puts them on top of the stack
- When a variable is referenced, Python looks at the top stack frame, then at global variables
- Unlike some languages, does not search every frame in the stack
- Comment on this slide
Parameter Passing Rules
- Python copies variables' values when passing them to functions
- Since Booleans, numbers, and strings can't be updated, a function can't affect its caller's values
- But if you pass a list to a function, the function will operate on the original list, not a copy
def mutate(x, y):
x = 0
y[0] = "modified"
single = 1
triple = [1, 2, 3]
print "before call, single is", single, "and triple is", triple
mutate(single, triple)
print "after call, single is", single, "and triple is", triple
before call, single is 1 and triple is [1, 2, 3]
after call, single is 1 and triple is ['modified', 2, 3]
![[Argument Passing (a)]](img/py03/arg_ref_a.png)
Figure 8.4: Argument Passing (a) |
![[Argument Passing (b)]](img/py03/arg_ref_b.png)
Figure 8.5: Argument Passing (b) |
![[Argument Passing (c)]](img/py03/arg_ref_c.png)
Figure 8.6: Argument Passing (c) |
![[Argument Passing (d)]](img/py03/arg_ref_d.png)
Figure 8.7: Argument Passing (d) |
- If you want to pass a copy of a list into
mutate, use mutate(single, triple[:])
triple[:] is the same as triple[0:len(triple)]…- …which is a slice of
triple that includes the entire list… - …and slicing creates a new list
![[Passing a Slice]](img/py03/pass_slice.png)
Figure 8.8: Passing a Slice |
- Comment on this slide
Default Parameter Values
- You can specify default values for parameters when defining a function
- Just “assign” some constant to the parameter in the definition
- The parameters actually passed when the function is called are matched up left to right
- So put the parameters the user is least likely to want to change at the end of the parameter list
def total(values, start=0, end=None):
# If no values given, total is zero.
if not values:
return 0
# If no end specified, use the entire sequence.
if end is None:
end = len(values)
# Calculate.
result = 0
for i in range(start, end):
result += values[i]
return result
numbers being added [10, 20, 30, 40, 50, 60]
total(numbers, 0, 3) 60
total(numbers, 4) 110
total(numbers) 210
- All arguments with defaults must come after all arguments without them
- Otherwise, matching values to parameters would be ambiguous
def total(values, start=0, end):
# If no values given, total is zero.
if not values:
return 0
# Calculate.
result = 0
for i in range(start, end):
result += values[i]
return result
- Comment on this slide
Extra Arguments
- If a function has been defined the right way, you can call it with extra arguments
- Last parameter in definition must have
* in front of its name- Doesn't matter what it's called, but most people use
*extra
- Any “leftover” parameters are put in a tuple and passed via this argument
- Remember, a tuple is an immutable (unmodifiable) list
- Can only be one
* argument per function- If there were two or more, how would Python know which extra values to put in which?
def show(first, *extra):
print first, extra
show(10)
show(10, 20, 30)
show(10, 'text', ['a', 'list'])
10 ()
10 (20, 30)
10 ('text', ['a', 'list'])
- Comment on this slide
Functions Are Objects
- Once it has been translated into instructions, a function is just another object
- Happens to be an object you can call, just as strings and lists happens to be objects you can index
def is just a shorthand for “create a function, and assign it to a variable”Pi = 3.14
def circum(r):
return 2 * Pi * r
for x in [1.0, 2.0, 3.0]:
print x, circum(x)
![[Functions As Objects]](img/py03/function_object.png)
Figure 8.9: Functions As Objects |
- This means you can:
- Redefine functions (just as you can reassign values to variables)
- Create aliases for functions
- Pass functions as parameters
- Store functions in lists
- Example: apply a function to each value in a list
# Apply a function to each value in a list.
def applyToList(func, values):
result = []
for x in values:
y = func(x)
result.append(y)
return result
# A sample function (calculates 1/3 of value).
def oneThird(a):
return a/3.0
# Test.
values = [0.0, 0.5, 0.75, 0.875, 0.9375]
output = applyToList(oneThird, values)
for i in range(len(values)):
print values[i], "=>", output[i]
0.0 => 0.0
0.5 => 0.166666666667
0.75 => 0.25
0.875 => 0.291666666667
0.9375 => 0.3125
- Example: apply several functions to a single value
# Apply each function in a list to a value.
def applyEach(functions, value):
result = []
for f in functions:
y = f(value)
result.append(y)
return result
# One half.
def oneHalf(a):
return a/2.0
# One third.
def oneThird(a):
return a/3.0
# One quarter.
def oneQuarter(a):
return a/4.0
# Test.
functions = [oneHalf, oneThird, oneQuarter]
output = applyEach(functions, 0.25)
for i in range(len(functions)):
print functions[i].__name__, output[i]
oneHalf 0.125
oneThird 0.0833333333333
oneQuarter 0.0625
- Note: every function has an attribute called
__name__, which is the name it was originally defined under # An example function.
def original(left, right):
return (left + right) / 2.0
# Create an alias for it.
alias = original
# Call them directly.
print "calling original:", original.__name__, original(3, 4)
print "calling alias:", alias.__name__, alias(3, 4)
# Get a little fancy.
print "Calling in loop..."
for f in [original, alias]:
print f.__name__, f(5, 8)
calling original: original 3.5
calling alias: original 3.5
Calling in loop...
original 6.5
original 6.5
- Comment on this slide
Creating Modules
- Every Python file is automatically also a module (or library)
- If the file is called
xyz.py, load it using import xyz
- The statements in the module are executed as it is loaded
- Assignment and
def are statements - You can use conditionals, loops, and anything else, too
- Refer to things in a module using
module.thing
- Put this in
mylib.py # A constant.
value = 123
# Remember, modules are executed as they're loaded...
if value < 200:
size = 'small'
else:
size = 'large'
# A function.
def printVersion():
print 'Stuff Version 2.2'
- Use it by putting this in
program.py # Load the contents of mylib.py.
import mylib
# Define our own version of value.
value = "something else entirely"
# Show mylib's values.
print 'mylib.value', mylib.value
print 'mylib.size', mylib.size
# Show out own values.
print 'own value:', value
# Call function.
mylib.printVersion()
- When
program.py runs, it prints this # Load the contents of mylib.py.
import mylib
# Define our own version of value.
value = "something else entirely"
# Show mylib's values.
print 'mylib.value', mylib.value
print 'mylib.size', mylib.size
# Show out own values.
print 'own value:', value
# Call function.
mylib.printVersion()
mylib.value 123
mylib.size small
own value: something else entirely
Stuff Version 2.2
- Notice that both
mylib.py and program.py define value
- Each module is its own global scope
- When something in
mylib.py references value, it always gets its own value
- You can also use:
import mylib as m, then call m.printVersion()from mylib import printVersion, then call printVersion()
- Question: what would happen if you did
from mylib import value?
from mylib import * imports everything from mylib
- Comment on this slide
The Math Library
- Python is a small language with a large library
- Much of the standard library is just Python interfaces to standard C libraries
- Example: the
math library| Type | Name | Purpose | Example | Result |
|---|
| Constant | e | Constant | e | 2.71828… |
| pi | Constant | pi | 3.14159… |
| Function | ceil | Ceiling | ceil(2.5) | 3.0 |
| floor | Floor | floor(-2.5) | -3.0 |
| exp | Exponential | exp(1.0) | 2.71828… |
| log | Logarithm | log(4.0) | 1.38629… |
| | log(4.0, 2.0) | 2.0 |
| log10 | Base-10 logarithm | log10(4.0) | 0.60205… |
| pow | Power | pow(2.5, 2.0) | 6.25 |
| sqrt | Square root | sqrt(9.0) | 3.0 |
| cos | Cosine | cos(pi) | -1.0 |
| asin | Arc sine | asin(-1.0) | -1.5707… |
| hypot | Euclidean norm x2 + y2 | hypot(2, 3) | 3.60555… |
| degrees | Convert from radians to degrees | degrees(pi) | 180 |
| radians | Convert from degrees to radians | radians(45) | 0.78539… |
Table 8.1: The Python Standard Math Library
- All the other trigonometric functions (
tan, arctan, etc.) are also there abs is built in to the language: don't have to import anything to use it
- Comment on this slide
The System Library
- Most commonly used library in Python is the system library
sys
- Information about Python (e.g., version number and copyright notice)
- Information about the environment (e.g., what operating system the program is running on)
- Advanced features that mere mortals should never meddle with
- By convention, always imported with
import sys
- Members referred to as
sys.x, rather than just x
- Contents
| Type | Name | Purpose | Example | Result |
|---|
| Data | argv | The program's command line arguments | argv[0] | "myscript.py" (or whatever your program is called) |
| maxint | Largest positive value that can be represented by Python's basic integer type | sys.maxint | 2147483647 |
| path | List of directories that Python searches when importing modules | sys.path | ['/home/greg/pylib', '/Python24/lib', '/Python24/lib/site-packages'] |
| platform | What type of operating system Python is running on | sys.platform | "win32" |
| stdin | Standard input | sys.stdin.readline() | (Typically) the next line of input from the keyboard |
| stdout | Standard output | sys.stdout.write('****') | (Typically) print four stars to the screen |
| stderr | Standard error | sys.stderr.write('Program crashing!\n') | Print an error message to the screen |
| version | What version of Python this is | sys.version | "2.4 (#60, Feb 9 2005, 19:03:27) [MSC v.1310 32 bit (Intel)]" |
| Function | exit | Exit from Python, returning a status code to the operating system | sys.exit(0) | Terminates program with status 0 |
Table 8.2: The Python System Library
sys.argv is automatically initialized to the program's command-line arguments- Program's name is always
sys.argv[0] - Example:
python program.py -d noisy gets ['program.py', '-d', 'noisy']
sys.path is the list of places Python is allowed to look to find modules for import- Initialized from the
PYTHONPATH environment variable- Directory containing the program being run is automatically put at the start of this list
- If
sys.path is ['/home/swc/exmpl/py03', '/home/swc/lib', '/Python24/lib'], then import quark will look for:
/home/swc/exmpl/py03/quark.py/home/swc/lib/quark.py/Python24/lib/quark.py- then fail
sys.stdin and sys.stdout are normally connected to the keyboard and screen- If you redirect input, or are using a pipe,
sys.stdin reads from a file or another program- In
python reader.py < somefile.txt, reader.py's sys.stdin reads from somefile.txt
- If you redirect output, or are using a pipe,
sys.stdout writes to a file or another programpython writer.py | sort | uniq sorts the output of writer.py and removes duplicate lines
sys.exit terminates the program, and returns an integer status code to the operating system- 0 indicates successful execution (“zero errors”)
- Non-zero is an error code
- Please be polite and return 1 or -1 if you detect an error in your program
- Yes, it's the opposite of what you'd expect…
- If you don't exit explicitly, Python returns 0
- Comment on this slide
Times
- Can't delete old files, or check how long your program has been running, without some notion of time
- Terminology:
- The epoch is the moment from which time is measured. On Unix, the epoch is midnight, January 1, 1970; time is measured in seconds since then.
- Coordinated Universal Time, or UTC, is the official time planet Earth. It used to be called Greenwich Mean Time (GMT).
- Local time is what the clock on your wall shows. It is defined to be UTC, plus or minus a constant offset determined by your time zone, possibly modified by daylight savings time.
- A leap year is a year that has an extra day (to keep the calendar in synch with Earth's orbit around the sun)
- A leap second is an extra second added to a day to keep UTC in synch with Earth's rotation
- Sometimes, two leap seconds have to be added.
- A time structure is an object with nine fields, representing a time in a more comprehensible format.
tm_year: year (e.g., 2005)tm_mon: month (integer in the range [1, 12])tm_mday: day of the month (integer in the range [1, 31])tm_hour: hour of the day (integer in the range [0, 23])tm_min: minute of the hour (integer in the range [0, 59])tm_sec: second of the minute (integer in the range [0, 61])tm_wday: week day (integer in the range [0, 6], with Monday being 0)tm_yday: day of the year (integer in the range [1, 366])tm_isdst: daylight savings time flag
- Contents of
time module (presented out of order so that they'll make more sense):
| Type | Name | Purpose | Example | Result |
|---|
| Function | time | The current time, in seconds since the epoch. | time.time() | 1111941637.0929999 (on March 27, 2005, at 11:40:39 EST) |
| gmtime | Convert a time (in seconds since the epoch) to a time structure representing the time in UTC. | time.gmtime(1111941637) | A structure with s.tm_year equal to 2005, s.tm_hour equal to 16, etc. |
| localtime | Like gmtime, but returns the local time. | time.localtime(1111941637) | A structure with s.tm_hou equal to 11 instead of 16. |
| ctime | Convert a time (in seconds since the epoch) to a readable string in local time. | time.ctime(1111941637) | "Sun Mar 27 11:40:37 2005" |
| asctime | Convert a time structure to a readable string in UTC. | time.asctime(time.gmtime(1111941637)) | "Sun Mar 27 16:40:37 2005" |
| mktime | Convert a time structure (or a 9-element tuple) to seconds since the epoch in the local time zone. | time.mktime((1970, 1, 1, 0, 0, 0, 0, 0, 0)) | 18000.0 (in Eastern Standard Time) |
| sleep | Pause the program for the specified number of seconds. | time.sleep(5) | Nothing, for five seconds. |
Table 8.3: The Python Time Library
- The names and behavior of these functions are taken from the standard C library
- Less consistent than one might want
- For example, no equivalent of
time.mktime for converting to UTC
- Always store and manipulate times in UTC
- Especially important in a world in which the telescope could be in Hawaii, and the database in Paris.
- Do not try to write calendar calculations yourself
- It's a lot harder than you think
- Use the
datetime and calendar modules
- Comment on this slide
Working with the File System
- The
os module contains a lot of functions for interacting with the operating system- The table below is only a fraction of them
| Type | Name | Purpose | Example | Result |
|---|
| Constant | curdir | The symbolic name for the current directory. | os.curdir | . on Linux. |
| pardir | The symbolic name for the parent directory. | os.pardir | .. on Linux. |
| sep | The separator character used in paths. | os.sep | / on Linux, \ on Windows. |
| linesep | The end-of-line marker used in text files. | os.linesep | \n on Linux, \r\n on Windows. |
| Function | listdir | List the contents of a directory. | os.listdir('/tmp') | The names of all the files and directories in /tmp (except . and ..). |
| mkdir | Create a new directory. | os.mkdir('/tmp/scratch') | Make the directory /tmp/scratch. Use os.makedirs to make several directories at once. |
| remove | Delete a file. | os.remove('/tmp/workingfile.txt') | Delete the file /tmp/workingfile.txt. |
| rename | Rename (or move) a file or directory. | os.rename('/tmp/scratch.txt', '/home/swc/data/important.txt') | Move the file /tmp/scratch.txt to /home/swc/data/important.txt. |
| rmdir | Remove a directory. | os.rmdir('/home/swc') | Probably not something you want to do… Use os.removedirs to remove several directories at once. |
| stat | Get information about a file or directory. | os.stat('/home/swc/data/important.txt') | Find out when important.txt was created, how large it is, etc. |
Table 8.4: The Python File System Library
- Note:
os.stat returns an object with at least ten members, the most important of which are:
st_size: size in bytesst_atime: time of most recent accessst_mtime: time of most recent modification- These times are in seconds since the epoch
import time, os
info = os.stat('stat_file.py')
print 'size:', info.st_size
print 'access time:', time.ctime(info.st_atime)
print 'modification time:', time.ctime(info.st_mtime)
size: 177
access time: Sat Jul 2 07:16:46 2005
modification time: Mon Jun 13 09:31:10 2005
- Comment on this slide
Manipulating Pathnames
- The
os module has a submodule called os.path
- Manipulate pathnames (filenames and directory names) correctly and efficiently
- Do not write your own functions for this—the rules are trickier than you think
- Used at least as often as
os itself
- Contents:
| Type | Name | Purpose | Example | Result |
|---|
| Function | abspath | Create normalized absolute pathnames. | os.path.abspath('../jeevan/bin/script.py') | /home/jeevan/bin/script.py (if executed in /home/gvwilson) |
| basename | Return the last portion of a path (i.e., the filename, or the last directory name). | os.path.basename('/tmp/scratch/junk.data') | junk.data |
| dirname | Return all but the last portion of a path. | os.path.basename('/tmp/scratch/junk.data') | /tmp/scratch |
| exists | Return True if a pathname refers to an existing file or directory. | os.path.exists('./scribble.txt') | True if there is a file called scribble.txt in the current working directory, False otherwise. |
| getatime | Get the last access time of a file or directory (like os.stat). | os.path.getatime('.') | 1112109573 (which means that the current directory was last read or written at 10:19:33 EST on March 29, 2005). |
| getmtime | Get the last modification time of a file or directory (like os.stat). | os.path.getmtime('.') | 1112109502 (which means that the current directory was last modified 71 seconds before the time shown above). |
| getsize | Get the size of something in bytes (like os.stat). | os.path.getsize('py03.swc') | 29662. |
| isabs | True if its argument is an absolute pathname. | os.path.isabs('tmp/data.txt') | False |
| isfile | True if its argument identifies an existing file. | os.path.isfile('tmp/data.txt') | True if a file called ./tmp/data.txt exists, and False otherwise. |
| isdir | True if its argument identifies an existing directory.. | os.path.isdir('tmp') | True if the current directory has a subdirectory called tmp. |
| join | Join pathname fragments to create a full pathname. | os.path.join('/tmp', 'scratch', 'data.txt') | "/tmp/scratch/data.txt" |
| normpath | Normalize a pathname (i.e., remove redundant slashes, uses of . and .., etc.). | os.path.normpath('tmp/scratch/../other/file.txt') | "tmp/other/file.txt" |
| split | Return both of the values returned by os.path.dirname and os.path.basename. | os.path.split('/tmp/scratch.dat') | ('/tmp', 'scratch.dat') |
| splitext | Split a path into two pieces root and ext, such that ext is the last piece beginning with a ".". | os.path.splitext('/tmp/scratch.dat') | ('/tmp/scratch', '.dat') |
Table 8.5: Python's Pathname Library
- Comment on this slide
Knowing Where You Are
- Python assigns the module's name to the special variable
__name__
"__main__" when the file is run from the command line- The module's name when it is loaded by something else
- Often used to draw attention to the main body of a large program
- Or to include self-tests in module
- If the module is loaded by something else, its name isn't
"__main__", so the tests aren't run
- Comment on this slide
Where to Learn More
Exercises
Exercise 8.1:
Write a function that takes two strings called text and
fragment as arguments, and returns the number of times
fragment appears in the second half of text. Your
function must not create a copy of the second half of
text. (Hint: read the documentation for string.count.)
Exercise 8.2:
What does the Python keyword global do?
What are some reasons not to write code that uses it?
Exercise 8.3:
Consider the following sample of code and its output:
def settings(first, **rest):
print 'first is', first
print 'rest is'
for (name, value) in rest.items():
print '...', name, value
print
settings(1)
settings(1, two=2, three="THREE")
first is 1
rest is
first is 1
rest is
... two 2
... three THREE
What does the variable rest do? What does the double
asterisk ** in front of its name mean? How does it compare
to the example with *extra (with a single asterisk) in the
lecture?
Exercise 8.4:
Python allows you to import all the functions and variables in a
module at once, making them local name. For example, if the
module is called values, and contains a variable called
Threshold and a function called limit, then after
the statement from values import *, you can then refer
directly to Threshold and limit, rather than having
to use values.Threshold or values.limit. Explain
why this is generally considered a bad thing to do, even though it
reduces the amount programmers have to type.
Exercise 8.5:
sys.stdin, sys.stdout, and sys.stderr are
variables, which means that you can assign to them. For example,
if you want to change where print sends its output, you can
do this:
import sys
print 'this goes to stdout'
temp = sys.stdout
sys.stdout = open('temporary.txt', 'w')
print 'this goes to temporary.txt'
sys.stdout = temp
Do you think this is a good programming practice? When and why
do you think its use might be justified?
Exercise 8.6:
os.stat(path) returns an object whose members describe
various properties of the file or directory identified by
path. Using this, write a function that will determine
whether or not a file is more than one year old.
Exercise 8.7:
Write a Python program that takes as its arguments two years (such
as 1997 and 2007), prints out the number of days between the 15th
of each month from January of the first year until December of the
last year.
Exercise 8.8:
Write a simple version of which in Python.
Your program should check each directory on the caller's path
(in order) to find an executable program that has the name given
to it on the command line.
Testing Basics
Motivation
- Testing software is just as important as calibrating experimental equipment, or checking the purity of reagants
- But it's done far less often
- Reason #1: people don't know how
- Reason #2: low standards can perpetuate themselves until someone decides to make quality an issue
- American cars in the 1970s
- Study after study has proved that the more time you spend testing, the less total time it takes to develop working software
- So long as you know how to test systematically…
- …and use the right tools to help you…
- …and design your software to be testable
- Studies also show that how well a team tests is a good predictor of how good they are at meeting schedule and quality targets
- Because discipline matters more than genius
- Comment on this slide
Terminology
- Unit testing (or component testing): testing one component in isolation
- E.g., test that all of a class's methods do what they're supposed to
- System testing (or integration testing): testing that components work properly in combination
- E.g., make sure that when X calls Y, it passes parameters in the right order
- These are two ends of a spectrum
- Most classes rely on several others, and so cannot be tested in isolation
- Often pursue a bottom-up testing strategy
- Test the simple classes first, then the ones that use them, then the ones that use those
- Often makes sense to test subsystems (made up of several classes), rather than the entire program
- E.g., test the voice recognition module
- Sometimes have to build scaffolding to support testing
- Like the scaffolding on a building site, it's extra code that is used during the construction process, but not delivered to customers
- Regression testing: rerunning tests periodically to see whether the code has regressed
- I.e., whether things that used to work are now broken
- Every time a change is made to the system, the programmer making the change should rerun the regression tests before checking her changes into version control
- Never slow other people down by breaking things that they're relying on
- A large program that doesn't have regression tests is difficult (sometimes impossible) to maintain
- The only way to know whether a change to one part will break something somewhere else is to run the whole program
- A fixture is a thing to be tested
- E.g., an object, a data structure, a complete database
- A test is something you do to a fixture
- A test result can be one of:
- Pass: the actual outcome is what was expected
- Fail: the actual outcome is different from what was expected
- Error: something went wrong with the test itself (i.e., the test contains a bug)
- Don't know anything about the system being tested
- Comment on this slide
Example: Rectangle Overlap
- Want to test a function
overlap that calculates the overlap between two rectangles- Each rectangle is represented by an instance of the class
Rect
- Constructor takes four non-negative integer arguments (low and high XY coordinates)
- Class has methods
getLoX, getLoY, getHiX, and getHiY - Rectangle is guaranteed to have non-zero area
- Function returns:
- A new rectangle representing the overlap of its inputs if they overlap
None if there's no overlap- Note: since rectangles have to have non-zero area, touching on an edge or corner does not count as overlap
- Analysis
- This is unit testing (one thing in isolation)
- Assume for the moment that
Rect is correct - I.e., that it has been tested elsewhere
- Each fixture will be a pair of rectangles
- The test will be to pass them to
overlap, and see if the output is correct
- First version of
overlap:
def overlap(left, right):
'''Calculate overlap between two rectangles.'''
loX = max(left.getLoX(), right.getLoX())
hiX = min(left.getHiX(), right.getHiX())
loY = max(left.getLoY(), right.getLoY())
hiY = min(left.getHiY(), right.getHiY())
if (loX >= hiX) or (loY > hiY):
return None
return Rect(loX, loY, loX, hiY)
- How many errors can you spot?
- First test passes:
r1 = Rect(0, 0, 1, 1)
r2 = Rect(2, 2, 3, 3)
result = overlap(r1, r2)
if result == None:
print 'pass'
else:
print 'fail'
pass
- But not the second one:
r1 = Rect(0, 0, 1, 1)
r2 = Rect(0, 1, 3, 3)
result = overlap(r1, r2)
if result == None:
print 'pass'
else:
print 'fail'
Traceback (most recent call last):
File "overlap.py", line 97, in ?
result = overlap(r1, r2)
File "overlap.py", line 64, in overlap
return Rect(loX, loY, hiX, hiY)
File "overlap.py", line 9, in __init__
assert 0 <= loY < hiY, 'illegal Y coordinates'
AssertionError: illegal Y coordinates
- This is an error in the test
- A good test will always run, no matter what state the code is in
- Revise the test and rerun:
try:
r1 = Rect(0, 0, 1, 1)
r2 = Rect(0, 1, 3, 3)
result = overlap(r1, r2)
if result == None:
print 'pass'
else:
print 'fail'
except Exception, e:
print 'error', e
pass
- Why is this change important?
- Need to be able to run all tests, all the time
- If the first test raises an exception that isn't handled, the other ninety-nine tests won't run at all
- So, what's the problem?
- Passing
loX to Rect's constructor twice, rather than loX and hiX - Fix and rerun tests
- Testing
loY > hiY instead of loY >= hiY - Fix and rerun tests
- We ought to test that
overlap works when rectangles do overlap try:
r1 = Rect(0, 0, 3, 3)
r2 = Rect(1, 1, 2, 2)
result = overlap(r1, r2)
if result == Rect(1, 1, 2, 2):
print 'pass'
else:
print 'fail'
except Exception, e:
print 'error', e
error name 'overlap' is not defined
- This is only one case out of many
![[Possible Overlap Tests]](img/test/possible_overlaps.png)
Figure 9.1: Possible Overlap Tests |
- How many tests should we write? When should we stop?
- Comment on this slide
General Rules for Unit Tests
- Make tests independent of each other
- No interaction between tests
- Nothing one test does can affect any other test
- In particular, the order in which tests are run shouldn't matter
- If two tests depend on one another, then when the first fails, the second either:
- Isn't run (so you have less information), or
- Runs, but gives you the wrong answer (which is even worse than no information)
- Keep tests small
- Each tests exactly one thing
- When a test fails, you'll know the cause right away
- Don't hard-code absolute path names, magic numbers, etc.
- Want to be able to move tests from machine to machine
- Have to b able to maintain tests as code evolves
- Test code should be at least as well written as production code
- Comment on this slide
A Simple Testing Framework
- Step back a moment and think about how we're writing our tests
- Each has two rectangles as input, and expects either a rectangle or
None - Each reports passes, fails, or errors
- Design principle: every time you see a pattern, try to capture it in code
- Write the shared code once
- Encourage people to do things the right way by making that way easiest
- Step 1: write a function that runs a test and reports its result
def runTest(left, right, expected):
try:
assert left is not None
assert right is not None
actual = overlap(left, right)
if actual == expected:
return 'pass'
else:
return 'fail'
except Exception, e:
return 'error'
- Just as easy to return strings as to define constants for pass, fail, and error
- Note: assertions on input arguments are inside the
try/except
- Passing
None as the left or right rectangle is an error in the test
- Step 2: write some tests
- Whenever possible, create a table of inputs and results
- Easier for people to read, understand, and add to than code
- Especially if those people aren't programmers
def runTest(left, right, expected):
try:
assert left is not None
assert right is not None
actual = overlap(left, right)
if actual == expected:
return 'pass'
else:
return 'fail'
except Exception, e:
return 'error'
- We can make this even more flexible by writing a function that tests an arbitrary function
- Remember that
def f(*args) creates a function that takes any number of arguments - If you have a tuple of values called
vals, then f(*vals) matches the values with f's arguments- E.g., if
vals is (11, 12, 13), then f(*vals) is the same as f(11, 12, 13) - Gives you a way to call an arbitrary function with an arbitrary set of arguments in a uniform way
- Use this to write a function
runAllTests that takes two arguments:
- The function you actually want to test
- A list of lists, each of which consists of zero or more arguments, and the expected result
def runAllTests(func, tests):
pass = fail = error = 0
for testCase in tests:
try:
args, expected = testCase[:-1], testCase[-1]
actual = func(*args)
if actual == expected:
pass += 1
else:
fail += 1
except Exception, e:
error += 1
for (label, count) in (('pass', pass), ('fail', fail), ('error', error)):
print '%s: %4d' % (label, count)
runAllTests(overlap, tests)
- Advantage: reusability
- Can use this to test any function
- Disadvantage: obscurity
- The extra level of abstraction makes it harder to understand what's going on
- But you only have to understand one hard thing once, instead of many slightly less hard things over and over
- Take a look in the next lecture at a full-fledged unit testing framework
- Comment on this slide
Choosing Test Cases
- You can't test everything
- There are roughly 10 million 7-digit phone numbers
- A function that puts two phone numbers in order has 1014 possible inputs
- At a million tests a second, it will take 108 seconds (or roughly three years) to check them all
- And then we move on to the second function…
- Faster computers won't get us out of this hole
- How many possible inputs are there to a function that sorts a list of strings?
- Human beings are creatures of habit
- Tend to make the same kinds of errors over and over again
- So test for those first
- Has an interesting side effect: once you start testing for habitual errors, programmers become more conscious of them, and make them less often
- The first tests you should write:
- Tests you expect to succeed
- Boundary cases (e.g., sort the empty list, or a list of one value)
- Simplest interesting case (e.g., sort a list of two values)
- General case (e.g., sort a list of nine values)
- If duplicate values are allowed, make sure you test with them
- Tests you expect to fail
- Invalid input (e.g., passed a dictionary instead of a list)
- Remember, error handling is part of the interface too
- Sanity tests
- Make sure data structures remain consistent
- If there is redundant information, check it against itself
- A catalog of errors
- Numbers: zero, largest, smallest magnitude, most negative
- Structures: empty, exactly one element, maximum number of elements
- Duplicate elements (e.g., the letter
"J" appears three times in a string) - Aliased elements (e.g., a list contains two references to another list)
- Circular structures (e.g., a list that contains a reference to itself)
- Searching: no match found, one match found, multiple matches found, everything matches
- Code like
x = findAll(structure)[0] is almost always wrong - Should also check aliased matches (same thing found multiple times)
- Comment on this slide
Dictionaries and Error Handling
Motivation
- Problem: you want to count how often different people send messages to a mailing list
- Input: a file with one email address per line
- Each address may appear many times
- Output: a table of addresses, sorted by frequency of appearance
- Solution: keep track of counts in a list of pairs
- Each entry in the list is
[address, countSoFar] - Each time an address is read in, search the list, and either:
- Increment the count of an existing entry, or
- Create a new entry with a count of 1
- Once everything has been read, sort the entries
- Note: you can pass your own comparison function to
list.sort
import sys
def countFind(counts, address):
for x in counts:
if x[0] == address:
return x
return None
def countFreq(lines):
counts = []
for line in lines:
line = line.strip()
existing = countFind(counts, line)
if existing:
existing[1] += 1
else:
counts.append([line, 1])
return counts
def compareByFrequency(a, b):
if a[1] < b[1]:
return 1
if a[1] == b[1]:
return 0
return -1
if __name__ == '__main__':
instream = open(sys.argv[1], 'r')
lines = instream.readlines()
instream.close()
result = countFreq(lines)
result.sort(compareByFrequency)
for address, frequency in result:
print frequency, ":", address
- This is clumsy and inefficient
- Clumsy: writing a lot of code to do something that ought to be simple
- Inefficient: repeatedly searching the entire list
- There's an easier way…
- Comment on this slide
String Formatting
- But before we start, let's tidy up a few loose ends
- Python's
% operator is used to format strings- Left side is a string specifying how things are to be formatted
- Right side is the values to be formatted
- One value on its own…
- …or a tuple if there are many values
'here %s go' % 'we' creates a new string "here we go"
- The
"%s" in the left string means “insert a string here” - So the string on the right is inserted
- Since strings are immutable, this creates a new string (does not modify the format string in place)
'left %d right %d' % (-1, 1) creates "left -1 right 1"
"%d" stands for “decimal integer”- Format specifiers are matched with values left to right
'do %s %d times' % ('this', 99) creates "do this 99 times"
'[%4d]' % 13 creates "[ 13]"
- The integer between the
"%" and "d" specifies a width '[%-4d]' % 13 creates "[13 ]", because negative widths mean “left justify”
'[%6.2f %%]' % 37.2 creates "[ 37.20 %]"
"%6.2f" means “a floating point number, six characters wide, with two after the decimal point”"%%" is translated into a single "%" character- Just as
"\\" is how you represent a single "\" in a string
'[%6.2e %%]' % 37.2 creates "[3.72e+001 %]"
"%e" is scientific (exponential) notation- Python will use more characters than it's been told to if it has to
- Comment on this slide
Dictionaries
- A sequence (such as a string or list) associates one value with each integer from 0 to N-1
- A dictionary associates one value with each of its keys
- Also called maps, hashes, and associative arrays
- Keys don't have to be integers
- In fact, they are usually strings
- Often visualized as a two-column table
- Each key in the left column must be unique
- Values can occur several times
- Example planet names as keys, and the lengths of their years as values
| Mercury | 87.97 |
| Venus | 224.70 |
| ⋮ | ⋮ |
| Pluto | 90550.0 |
- Comment on this slide
The Mechanics
- Create a dictionary by putting key/value pairs inside
{}
{'Newton':1642, 'Darwin':1809}- Empty dictionary written
{} - Get a value from an existing dictionary using
[] (just like everything else in Python) birthday = {
'Newton' : 1642,
'Darwin' : 1809
}
print "Darwin's birthday:", birthday['Darwin']
print "Newton's birthday:", birthday['Newton']
Darwin's birthday: 1809
Newton's birthday: 1642
- Assigning to a dictionary key:
- Creates a new entry if the key is not already in dictionary
- Overwrites the previous value if the key is already present
birthday = {}
birthday['Darwin'] = 1809
birthday['Newton'] = 1942 # oops
birthday['Newton'] = 1642
print birthday
{'Darwin': 1809, 'Newton': 1642}
- Can only access keys that are present
- Just as you can't index elements of a list that aren't there
birthday = {
'Newton' : 1642,
'Darwin' : 1809
}
print birthday['Turing']
Traceback (most recent call last):
File "key_error.py", line 5, in ?
print birthday['Turing']
KeyError: 'Turing'
- Remove an entry using
del d[k]
- Can only remove entries that are actually present
birthday = {
'Newton' : 1642,
'Darwin' : 1809,
'Turing' : 1912
}
print 'Before deleting Turing:', birthday
del birthday['Turing']
print 'After deleting Turing:', birthday
Before deleting Turing: {'Turing': 1912, 'Newton': 1642, 'Darwin': 1809}
After deleting Turing: {'Newton': 1642, 'Darwin': 1809}
for k in d loops over the dictionary's keys (rather than its values)
- Note: this is different from lists, where
for loops over the values, rather than the “keys” (indices) birthday = {
'Newton' : 1642,
'Darwin' : 1809,
'Turing' : 1912
}
for name in birthday:
print name, birthday[name]
Turing 1912
Newton 1642
Darwin 1809
- Test whether a key
k is in a dictionary d using k in d
- Once again, inconsistent with behavior of lists, but useful
birthday = {
'Newton' : 1642,
'Darwin' : 1809
}
for name in ['Newton', 'Turing']:
if name in birthday:
print name, birthday[name]
else:
print 'Who is', name, '?'
Newton 1642
Who is Turing ?
- Comment on this slide
Dictionary Methods
- Yes, dictionaries are objects too…
| Method | Purpose | Example | Result |
|---|
clear | Empty the dictionary. | d.clear() | Returns None, but d is now empty. |
get | Return the value associated with a key, or a default value if the key is not present. | d.get('x', 99) | Returns d['x'] if "x" is in d, or 99 if it is not. |
keys | Return the dictionary's keys as a list. Entries are guaranteed to be unique. | birthday.keys() | ['Turing', 'Newton', 'Darwin'] |
items | Return a list of (key, value) pairs. | birthday.items() | [('Turing', 1912), ('Newton', 1642), ('Darwin', 1809)] |
values | Return the dictionary's values as a list. Entries may or may not be unique. | birthday.values() | [1912, 1642, 1809] |
update | Copy keys and values from one dictionary into another. | See the example below. | |
Table 10.1: Dictionary Methods
- Example:
birthday = {
'Newton' : 1642,
'Darwin' : 1809,
'Turing' : 1912
}
print 'keys:', birthday.keys()
print 'values:', birthday.values()
print 'items:', birthday.items()
print 'get:', birthday.get('Curie', 1867)
temp = {
'Curie' : 1867,
'Hopper' : 1906,
'Franklin' : 1920
}
birthday.update(temp)
print 'after update:', birthday
birthday.clear()
print 'after clear:', birthday
keys: ['Turing', 'Newton', 'Darwin']
values: [1912, 1642, 1809]
items: [('Turing', 1912), ('Newton', 1642), ('Darwin', 1809)]
get: 1867
after update: {'Curie': 1867, 'Darwin': 1809, 'Franklin': 1920, 'Turing': 1912, 'Newton': 1642, 'Hopper': 1906}
after clear: {}
- Comment on this slide
Counting Frequency
- Revisit the problem of counting how often things appear in a file
- Use the things (email addresses) as keys in a dictionary
- The value associated with each key is the number of times it has been seen so far
# Data to count.
names = ['Be','Mg','Mg','Ca','Be','Mg', 'Be']
# Build a dictionary of frequencies.
freq = {}
for name in names:
# Already seen, so increment count by one.
if name in freq:
freq[name] = freq[name] + 1
# Never seen before, so add to dictionary.
else:
freq[name] = 1
# Display.
print freq
{'Be': 3, 'Mg': 3, 'Ca': 1}
![[Counting Frequency]](img/py04/freq_count.png)
Figure 10.1: Counting Frequency |
- Note: dictionaries do not store keys in order
- No way to predict where a dictionary will insert a new value
- Paradoxically, this is part of what makes dictionaries so fast
- So, to print frequency in alphabetic order by address:
- Get the list of keys
- Sort that list
- Loop over it
# Build the frequency dictionary as before.
names = ['Be','Mg','Mg','Ca','Be','Mg', 'Be']
freq = {}
for name in names:
if name in freq:
freq[name] = freq[name] + 1
else:
freq[name] = 1
# Print in alphabetical order by key.
keys = freq.keys()
keys.sort()
for k in keys:
print k, freq[k]
Be 3
Ca 1
Mg 3
- Can simplify this code using
dict.get
- Get either the count associated with the key, or 0, then add one to it
# Use 'get' to simplify the counting loop.
names = ['Be','Mg','Mg','Ca','Be','Mg', 'Be']
freq = {}
for name in names:
freq[name] = freq.get(name, 0) + 1
# Print.
keys = freq.keys()
keys.sort()
for k in keys:
print k, freq[k]
Be 3
Ca 1
Mg 3
- But how to print in order of frequency?
- Need to invert the dictionary (i.e., swap keys and values)
- Problem: there might be collisions, since values aren't guaranteed to be unique
- What is the inverse of
{'a':1, 'b':1, 'c':1}?
- Solution: store a list of values instead of just a single value
# Count values as before.
names = ['Be','Mg','Mg','Ca','Be','Mg', 'Be']
freq = {}
for name in names:
freq[name] = freq.get(name, 0) + 1
# Invert.
inverse = {}
for (key, value) in freq.items():
seen = inverse.get(value, [])
seen.append(key)
inverse[value] = seen
# Print.
keys = inverse.keys()
keys.sort()
for k in keys:
print k, inverse[k]
1 ['Ca']
3 ['Be', 'Mg']
![[Tracing Execution of Dictionary Inversion]](img/py04/dict_count_inverse_trace.png)
Figure 10.2: Tracing Execution of Dictionary Inversion |
- Comment on this slide
Formatting Strings with Dictionaries
- Complex string formatting can be hard to understand
- Especially if one value needs to be used several times
- E.g., producing an email form letter, and need to put the person's name in three places
- Instead of a tuple,
"%" can take a dictionary as its right argument- Use
"%(varname)s" or "%(varname)4d" inside format string to identify what's to be substituted '%(word)s, t%(word)s, and everyw%(word)s' % {'word' : 'here'} creates "here, there, and everywhere"
- Comment on this slide
Catching Errors
- Like other modern languages, Python uses exceptions to handle errors
- Separating normal (expected) operation from exceptional cases makes code easier to read
- Structured like
if/else
- Introduce a block of statements with
try - Then introduce error handling code with
except - When something goes wrong in the
try block, Python raises an exception - Python finds the matching
except block, and executes whatever instructions it contains print 'before try/except'
try:
print '..first line of try block'
print 1/0
print '..last line of try block'
except:
print '..handling error!'
print 'after try/except'
before try/except
..first line of try block
..handling error!
after try/except
![[Flow of Control in Try/Except]](img/py04/simple_try_except.png)
Figure 10.3: Flow of Control in Try/Except |
- You can include code to execute when there isn't an error using
else
for val in (0.0, 1.0):
print "val is", val
try:
x = 1.0/val
except:
print '..handling error!'
else:
print '...in the else block'
val is 0.0
..handling error!
val is 1.0
...in the else block
- Comment on this slide
Exception Objects
- When an error occurs, Python actually creates an object to store information about it
- Typically an error message
- Can choose to handle specific errors by specifying an exception type in the
except statement- E.g., handle division by zero, but not out-of-bounds list index
- Syntax is
except ExceptionType, variable
ExceptionType is what you want to catch- If something of that type is raised, the exception object is assigned to
variable
values = [-1.0, 0.0, 1.0]
for i in range(4): # note: top index will be out of bounds
try:
x = 1.0 / values[i]
except ZeroDivisionError, e:
print 'divide by zero:', e
divide by zero: float division
Traceback (most recent call last):
File "except_with_type.py", line 4, in ?
x = 1.0 / values[i]
IndexError: list index out of range
- You can associate any number of
except blocks with a single try
- They're tested in order—whichever matches first, wins
- If a “naked”
except appears, it must be the last one (since it catches everything)
- Better to use
except Exception, e so that you have the exception object
- Exceptions are organized in a hierarchy
- E.g.,
ZeroDivisionError, OverflowError, and FloatingPointError are all specialized versions of ArithmeticError - And
IndexError (list/string index out of bounds) and KeyError (non-existent dictionary key) are specializations of LookupError for i in range(2):
try:
if i == 0:
vals = ['a', 'b']
x = vals[99]
else:
vals = {20 : 'a', 30 : 'b'}
x = vals[40]
except KeyError, e:
print 'loop %d, handling key errors:' % i, e
except LookupError, e:
print 'loop %d, handling generic lookup errors:' % i, e
loop 0, handling generic lookup errors: list index out of range
loop 1, handling key errors: 40
- We'll see in a later lecture how this hierarchy is implemented
- Hint: it has something to do with objects…
- Common exception types
| Name | Purpose |
|---|
Exception | Root of exception hierarchy. |
ArithmeticError | Illegal arithmetic operation |
IndexError | Bad index to sequence (out of bounds or illegal type) |
KeyError | Nonexistent index to dictionary |
TypeError | Illegal type (e.g., trying to add integer and string) |
ValueError | Illegal value (e.g., math.sqrt(-1)) |
IOError | Unable to create or open file, read data, etc. |
Table 10.2: Common Exception Types
- Comment on this slide
Functions and Exceptions
- So if a function raises an exception, who gets to handle it?
- Each time Python enters a
try/except block, it pushes the except handlers on a stack- Just like the function call stack
- When an exception is raised, Python look in this stack, and jumps to the first matching handler
def invert(vals, index):
try:
vals[index] = 1/vals[index]
except ArithmeticError, e:
print 'inner exception handler:', e
def each(func, vals, indices):
try:
for i in indices:
func(vals, i)
except IndexError, e:
print 'outer exception handler:', e
# Note: top index will be out of bounds
allValues = [-1, 0, 1]
allIndices = [0, 1, 2, 3]
each(invert, allValues, allIndices)
calling func for index 0
calling func for index 1
inner exception handler: integer division or modulo by zero
calling func for index 2
calling func for index 3
outer exception handler: list index out of range
![[Stacking Exception Handlers]](img/py04/exception_stack.png)
Figure 10.4: Stacking Exception Handlers |
- Comment on this slide
Raising Exceptions
- Use
raise to trigger exception processing- Obviously, have to specify the type of exception you're raising
- E.g.,
raise Exception('this is an error message')
- Please make your error messages more informative…
for i in range(4):
try:
# Raise exception if value is odd.
if (i % 2) == 1:
raise ValueError('index is odd')
else:
print 'not raising exception for %d' % i
except ValueError, e:
print 'caught exception for %d' % i, e
not raising exception for 0
caught exception for 1 index is odd
not raising exception for 2
caught exception for 3 index is odd
- Raise exceptions to report errors in functions
- Rather than returning
None or -1 or something like that list.find (which returns -1 if something can't be found) is unusual in this respect- But it's more convenient than
list.index, which throws an exception
- General rule: throw low, catch high
- I.e., throw lots of very specific exceptions, but only catch them in a few places, high in your code, where you can take corrective action
- Reason #1: every application handles errors differently
- If someone is using your library in a GUI, you don't want to be printing to
stderr
- Reason #2: error handling code gets in the way of understanding the algorithm, so disentangle the two
import sys
# Look for the first matching line in a file.
def findFirstMatchingLine(filename, word):
infile = open(filename, 'r')
lines = infile.readlines()
infile.close()
for line in lines:
if line.find(word) >= 0:
return line.rstrip()
errMsg = '%s does not contain %s' % (filename, word)
raise ValueError(errMsg)
if __name__ == '__main__':
try:
word = sys.argv[1]
for filename in sys.argv[2:]:
line = findFirstMatchingLine(filename, word)
print '%s / %s => %s' % (filename, word, line)
except ValueError, e:
print >> sys.stderr, e
except IOError, e:
print >> sys.stderr, e
except Exception, e:
print >> sys.stderr, e
import sys
# Look for the first matching line in a file.
def findFirstMatchingLine(filename, word):
infile = open(filename, 'r')
lines = infile.readlines()
infile.close()
for line in lines:
if line.find(word) >= 0:
return line.rstrip()
errMsg = '%s does not contain %s' % (filename, word)
raise ValueError(errMsg)
if __name__ == '__main__':
try:
word = sys.argv[1]
for filename in sys.argv[2:]:
line = findFirstMatchingLine(filename, word)
print '%s / %s => %s' % (filename, word, line)
except ValueError, e:
print >> sys.stderr, e
except IOError, e:
print >> sys.stderr, e
except Exception, e:
print >> sys.stderr, e
- Comment on this slide
Assertions
- An assertion is a claim about the state of a program that must be true if the program is correct
- E.g., “this list is not empty”, or “this value is greater than that one”
- Three kinds of assertions:
- A pre-condition is something that must be true in order for something else to work
- E.g., “argument is non-negative” is a pre-condition for
math.sqrt
- A post-condition is something that is guaranteed to be true after something else happens
- E.g., “result is either -1 or a legal index to the string” is a post-condition of
string.find
- Anything else is simply called a state assertion
- E.g., “this loop always executes at least once”
- Embed assertions in programs using the
assert statement- First argument is a Boolean condition
- Second (optional) is an error message
- If the condition isn't satisfied, Python raises an
AssertionError exception def findRange(values):
assert (values != None) and (len(values) > 0), 'Illegal input'
left = min(values)
right = max(values)
assert left < right, 'Empty range'
return (left, right)
- Good programmers practice defensive programming by putting lots of assertions in their code
- “Executable documentation”
- Describes how the program is supposed to behave, in a way that the computer can check
- Can't ever be out of step with the actual code, since it is code
- Fail early, fail often
- The less time that elapses between when an error occurs, and when its effects show up, the easier it is to fix
- Bugs grow up to be assertions
- If you made an error once, there's a good chance you'll make it again
- Adding an assertion helps you find and fix it faster the second time
- Assertions should only be used to check the program's correctness, not the correctness of its inputs
- I.e., don't use
assert to check that your program managed to open a file - It isn't the program's fault if it doesn't have read permission…
- Complex assertions are more likely to contain bugs than to catch them
- If it won't fit comfortably on one line, it probably shouldn't be an assertion
- Comment on this slide
Running Other Programs
- One program can run another by spawning a new process
- Make and the shell both do this
- So can your programs
os.popen runs a command, and lets you either:
- Write to its standard input, or
- Read from its standard output
![[Running Another Program]](img/py04/os_popen.png)
Figure 10.5: Running Another Program |
- Example: read the output of the
ls command into your programimport os
instream = os.popen('ls', 'r')
lines = instream.readlines()
instream.close()
for line in lines[:4]:
line = line.rstrip()
print '%4d: %s' % (len(line), line)
13: addresses.txt
13: assertions.py
14: create_dict.py
18: create_dict.py.out
- Example: send the output of your program to
gzip for compression:
import os
# Create compressed output and report its size.
outstream = os.popen('gzip -c > temp.txt.gz', 'w')
for i in range(1000):
for word in 'this is a test'.split():
print >> outstream, word
outstream.close()
print 'output is %d bytes' % os.stat('temp.txt.gz').st_size
# See how big it would have been.
instream = os.popen('gunzip -c temp.txt.gz', 'r')
total = 0
for line in instream.readlines():
total += len(line)
instream.close()
print 'output would have been %d bytes' % total
output is 79 bytes
output would have been 15000 bytes
- You can use
os.popen2 to connect to the other program's input and output simultaneouslyimport os
childIn, childOut = os.popen2('sort')
for word in 'this is a test'.split():
print >> childIn, word
childIn.close()
for line in childOut.readlines():
print line.rstrip()
childOut.close()
a
is
test
this
- But you must be very careful about deadlock
- If the child is trying to write to the parent while the parent is trying to write to the child, both are blocked
- The operating system can only hold on to a limited amount of input or output data at a time, so increasing the buffer size only delays the problem
![[Deadlock]](img/py04/deadlock.png)
Figure 10.6: Deadlock |
- Things are even trickier if you use
os.popen3, which gives you the child process's stdin, stdout, and stderr
- If the parent is trying to read from the child's
stdout while it's trying to write an error message to stderr, both programs will be blocked
- Solutions are outside the scope of this course
- Use
os.popen (with temporary files if necessary) until you're comfortable enough to move on
- Comment on this slide
Exercises
Exercise 10.1:
Suppose you wanted to sort entries with the same frequency
alphabetically. What changes would you have to make to
compareByFrequency?
Debugging
What It Is
- Finding and fixing flaws in software
- Assume for now that you built the right thing the wrong way
- Requirements errors are actually a major cause of software project failure
- You're going to spend half your professional life debugging
- After all, once it's fixed, you move on to something else
- So learn how to do it well
- Talk about tools first (because they'll make everything else less painful)
- Then some simple rules
- Remember: never debug standing up
- The faster you try to work, the slower you'll go
- Comment on this slide
What's Wrong with Print Statements
- Many people still debug by adding print statements to their programs
- “Gee, I wonder what the value of
maxTemp is when I call resolve?”
- Debugging with print statements is:
- Error-prone: adding print statements is a good way to add typos
- Particularly when you have to modify the block structure of your program to fit it in
- Time-consuming
- All that typing…
- And (if you're using Java, C++, or Fortran) all that recompiling…
- Misleading
- Adding print statements moves things around in memory, changes execution timing, etc.
- Common to have bugs that hide when print statements are added, and reappear when they're removed
- There's a much better way
- Comment on this slide
Symbolic Debuggers
- A symbolic debugger is a program that runs another program on your behalf
- Called “symbolic” because it can show you the source code you wrote, rather than the raw machine code that's executing
- Using a debugger is one of the two things that separates professionals from amateurs
- Using version control is the other
- While the target program (or debuggee) is running, the debugger can:
- Pause, resume, or restart the target
- Inspect and display the values of variables
- Change those values
- Watch for certain events
- Do not need to modify the source of the target program!
- Depending on your language, you may need to compile it with different flags
- And yes, the debugger modifies the target's layout in memory, and execution speed…
- …but a lot less than print statements…
- …with a lot less effort from you
- Comment on this slide
Running in a Debugger
- Depending on language and tools, there may be several ways to get into the debugger
- Launch the debugger, load the target program, and start work
- Run the debugger with the target program as a command-line argument
- In interpreted languages like Python, you can also switch into debugging mode in the middle of an interactive session
- Sometimes also do post mortem debugging
- When a program fails badly (e.g., illegal memory access in C++), it creates a core dump
- Copies all of its internal state to a file on disk
- Can then load that dump into the debugger, and see where the program was when it died
- Not as good as watching it run…
- …but sometimes the best you can do
- Modern graphical debuggers typically show:
- The source code
- The call stack
- The values of variables that are currently in scope
- I.e., global variables, parameters to the current function call, and local variables in that function
- A panel displaying what your program has printed to standard output and/or standard error
![[A Debugger in Action]](img/debugging/debugger_in_action.png)
Figure 11.1: A Debugger in Action |
- If the debugger is part of a larger integrated development environment (IDE), it may show other things as well
- A source browser that presents an outline of the project's modules, classes, functions, variables, etc.
![[Source Browser]](img/debugging/source_browser.png)
Figure 11.2: Source Browser |
- A code assistant that presents context-sensitive help and documentation
![[Code Assistant]](img/debugging/code_assistant.png)
Figure 11.3: Code Assistant |
- Note: tools like this are available for every modern language
![[Microsoft Visual Studio in Action]](img/debugging/msvstudio.png)
Figure 11.4: Microsoft Visual Studio in Action |
- Comment on this slide
Basic Operations
- The debugger can set breakpoints in the target program
- Tells the target to pause when it reaches particular lines of code
- Once the program is paused, the debugger can single-step it
- Execute one statement at a time
- Most can:
- When the target is paused, the debugger can display the contents of its memory
- The compiler (or interpreter) keeps track of the types of values in memory…
- …so the debugger can show integers as integers, rather than as sequences of bytes
![[Inspecting Values]](img/debugging/inspecting_values.png)
Figure 11.5: Inspecting Values |
- Most debuggers can also evaluate expressions using the current values of variables
- E.g., type in
2*x<0, debugger displays False - Yes, this means that every debugger contains a mini-interpreter for the language in question…
- This is a lot easier than adding print statements…
- Comment on this slide
How Debuggers Work
- A running program is:
- A bunch of bytes in memory whose values are instructions
- Each statement in the source program typically corresponds to several instructions
- The compiler (or interpreter) can produce a table that records this mapping
- Another bunch of bytes whose values are the program's variables
- The static space contains constant strings, magic numbers, etc.
- The call stack is the parameters to all currently-active function calls
- The heap is all dynamically-created objects
- Some registers
- An instruction pointer that keeps track of what to execute next
- A stack pointer to keep track of the call stack
- Miscellaneous other registers for doing arithmetic, etc.
- In order to set a breakpoint on a particular line, a debugger:
- Figures out which instructions were produced for the statement on that line
- Records the first of those instructions in a safe place
- Replaces that instruction with a
HALT instruction- Every processor and virtual machine has one of these
![[Creating a Breakpoint]](img/debugging/setting_breakpoint.png)
Figure 11.6: Creating a Breakpoint |
- When the target program reaches the
HALT instruction:
- It stops, and sends a signal to the operating system…
- …which passes the signal on to the debugger…
- …which waits for instructions from the user
- Such as “Show me the value of
maxTemp”
- When the user wants the program to resume, the debugger:
- Puts the instruction at the breakpoint location back in place
- Tells the program to execute just that instruction
- Every processor and virtual machine has a “single step” mechanism
- Replaces the instruction with a
HALT once again- Otherwise, the breakpoint will disappear
- Tells the program to run normally
- Comment on this slide
Advanced Operations
- Debugger lets you move up and down the call stack
- Remember, the set of outstanding function calls is just a bunch of variables in memory
- Debugger can ask how large each one is, so it knows where to go to get to the previous stack frame
- Lets you figure out how execution got to a particular place
- Otherwise known as, “What?? We shouldn't be here!!”
- Can modify variables in a running program, too
- If you have a theory about why an intermittent bug is occurring:
- Run to that point
- Set variables' values (e.g., set
maxTemp to -1) - Resume execution
- Sometimes used to test error handling code
- Easier to change
timeSpentWaiting to 600 seconds in debugger than to pull out the network cable and wait…
- Can set conditional breakpoints
- Only pauses program if some condition is met
- Loop index greater than 100
- Filename argument to function is
None
- Some debuggers also support watchpoints
- Have the debugger watch every write to memory
- Halt when anything, anywhere, modifies a particular variable
- Slow the program down a lot
- Typically a factor of 100 or more
- But sometimes the only practical way to find out why a string in C is being overwritten
- Comment on this slide
Rule 0: Get It Right the First Time
- The simplest bugs to fix are the ones that don't exist
- Design, reflect, discuss, then code
- A week of hard work can sometimes save you an hour of thought
- Make good style a habit
- Train yourself to do things right, so that you'll code well even when you're tired, stressed, and facing a deadline
- Comment on this slide
Rule 1: What Is It Supposed to Do?
- First step is knowing what the problem is
- “It doesn't work” isn't good enough
- What exactly is going wrong?
- How do you know?
- Requires you to know how the software is supposed to behave
- Is this case explicit in the specification?
- If not:
- Do you have enough knowledge to extrapolate?
- Do you have the right to do so?
- Write it down!
- Preferably away from the keyboard and screen
- Try not to let what you're seeing influence what you expect to see
- Cognitive dissonance leads us to adjust what we think we're observing in order to fit preconceptions
- Question: What do testers do while waiting for code?
- Answer: read the specification and start writing down test cases
- If there's no specification, you're wasting valuable time
- And if QA isn't involved in drawing up the specification, to guarantee testability, you're wasting even more
- Comment on this slide
Rule 2: Is It Plugged In?
- Are you actually exercising the problem that you think you are?
- Are you giving it the right test data?
- Is it configured the way you think it is?
- Is it the version you think it is?
- Has the feature actually been implemented yet?
- Why are you sure?
- Maybe the reason you can't isolate the problem is that it's not there
- Another argument in favor of automatic regression tests
- Guaranteed to rerun the test the same way each time
- Also a good argument against automatic regression tests
- If the test is wrong, it will generate the same false negative each time
- Comment on this slide
Rule 3: Make It Fail
- You can only debug things when you can see them going wrong
- Find a test case that makes the code fail
- Then try to simplify it
- Then try to find another one
- Use data to weed out false hypotheses
- Remember, it's computer science, not computer flip-a-coin
- If you can't make it fail repeatedly, you're going to have to rely on:
- Post-mortem inspection (autopsies)
- But then you have to reason backwards to figure out why the program crashed
- Logging with print statements
- But this can distort the program's behavior
- Comment on this slide
Rule 4: Divide and Conquer
- Once you have a test that makes the system fail, isolate the faulty subsystem
- Faults can only occur upstream from where the system fails
- Look at input to the module that's failing
- If that's wrong, look at preceding module's input, and so on
- Fail early, fail often
- Use
assert to trap consistency errors early - Every time you fix a bug, see whether you can add an assertion
- If you made the mistake once, odds are that you, or someone, will make it again
- Another strong argument against duplicated code
- Comment on this slide
Rule 5: Change One Thing at a Time, For a Reason
- Replacing random chunks of code unlikely to do much good
- If you got it wrong the first time, what makes you think you'll get it right the second? Or the ninth?
- So always have a hypothesis before making a change
- Change one thing, then re-run your tests
- Changes can often uncover (or introduce) new bugs
- The more things you change at once, the harder it is to know what's responsible for what
- And the harder it is to keep track of what you've done, and what effect it had
- Comment on this slide
Rule 6: Write It Down
- Science works because scientists keep records
- If you try to keep the state of play in your head, you will lose track (and time)
- “Wait, didn't A followed by B with an odd number of lines cause those symptoms? Or was it B followed by A? Or was it A and B with an even number of lines?”
- Records particularly useful when getting help
- People are more likely to listen when you can explain clearly what you did
- Comment on this slide
Rule 7: Be Humble
- If you can't find it in 15 minutes, ask for help
- Just explaining the problem aloud is often enough
- Don't keep telling yourself why it should work: if it doesn't, it doesn't
- Never debug while grinding your teeth, either…
- Keep track of your mistakes
- Just as runners keep track of their time for the 100 meter sprint
- You're most likely to fix what you're paying attention to
- Comment on this slide
Summary
- How productive you are depends on:
- What tools you use
- How well you use them
- A symbolic debugger will make you more productive
- Like any other tool, there's a learning curve
- But after a few sessions, you will have more working code in less time
- The best debugger in the world is no substitute for working methodically
- The scientific method is the best tool we've ever invented for finding out “why”
- Comment on this slide
Object-Oriented Programming
Motivation
- How should a programmer represent XY points?
- Using a list of two numbers is convenient, but:
- People reading the code have to remember (or guess) that
p[1] means “the Y coordinate” - There's nothing to stop someone accidentally appending a third “coordinate”, or setting the X coordinate to
"orange"
- (Almost) all modern languages encourage programmers to define abstract data types
- Called abstract because they hide the details of their implementation
- Programmers interact with them through a limited set of operations, rather than by manipulating data directly
- Fewer things can go wrong
- Easier to read resulting code
- Makes code easier to maintain, since internals can be changed without changing calling code
- An ADT is usually created by defining a class, then creating one or more objects that are instances of that class
- All objects have the same methods (abilities), but their own state
- Changes to one object do not affect the state of others
- Comment on this slide
A Naked Class
- Define a class in Python using the
class keyword- Like other keywords,
class is followed by ":", and introduces block - Name of the new class is usually followed by
object in parentheses- We'll see other options later
- Create a new instance of the class by “calling” the class name as if it were a function
- If the class's body is
pass, this creates a blank object - See in a moment how to make objects more interesting
- Most objects have data members that store their internal state
- A “personal” variable, like “my nose” or “my watch”
- Use the expression
obj.data to access the data member of obj
- Unlike C++ and Java, Python has no notion of “hiding” data members
- Create data members simply by referring to them
- The expression
obj.x = 3 either:
- Gives
obj a new member x, and assigns it the value 3, or - Overwrites the existing member
x with the value 3
- Just like creating variables in a function
- Example: an XY point
class point2d(object):
pass
if __name__ == '__main__':
p = point2d()
p.x = 0.0
p.y = 3.0
print "point is", p
print "coordiantes are", p.x, p.y
- Comment on this slide
Methods
- Naked classes are just glorified dictionaries
- Instead of
obj['x'], have obj.x
- Really become interesting when we start to define methods
- Any function defined inside a class belongs to that class
- When a method is called, Python passes the object itself as the method's first argument
- Which means that every method must take at least one argument
- By convention, it is called
self
- Why use methods?
- Make code easier to understand by grouping operations
- Give programmers a place to check the integrity of their ADTs
- The expression
obj.meth(arg) is equivalent to:
- Find the class that
obj is an instance of - Call that class's
meth method with obj and arg as arguments
- When a new object is created, Python checks whether the class has defined a method called
__init__
- If it has, Python calls it before doing anything else
- This constructor can give the object its initial data members simply by assigning them values
- Inside a method, refer to the object's
data member as self.data
- Since the object was passed to the method as the parameter
self - The expression
data is not the same as self.data
data on its own means “a local variable called data”
- Example: a 2D point in the upper right quadrant
class point2d(object):
# Constructor defines how to create new objects.
def __init__(self):
self.x = 0
self.y = 0
# Get X coordinate.
def getX(self):
return self.x
# Set X coordinate.
def setX(self, newX):
assert newX >= 0
self.x = newX
# Get and set Y coordinate.
def getY(self):
return self.y
def setY(self, newY):
assert newY >= 0
self.y = newY
# Calculate norm.
def norm(self):
return math.sqrt(self.x ** 2 + self.y ** 2)
- The
__init__ constructor sets the point's initial coordinates - The getter methods replace raw read access
- The setter methods check the constraints on the state of the class
- If someone tries to assign a negative coordinate, the class will raise an exception
- Fail early, fail often
- Note: when testing the class, test the error handling too
# Tests.
if __name__ == '__main__':
p = point2d()
assert p.getX() == 0.0
assert p.getY() == 0.0
p.setX(3.0)
p.setY(4.0)
assert p.getX() == 3.0
assert p.getY() == 4.0
assert p.norm() == 5.0
try:
p.setX(-1.0)
assert False
except AssertionError:
pass
- Keep the memory model in mind when thinking about objects
![[Memory Model]](img/oop/memory_model.png)
Figure 12.1: Memory Model |
- A method is just a function that Python can call in a particular way
- You can call it directly if you want to
p = point2d()
p.setX(9.0)
p.setY(2.0)
print "calling getX directly gives", point2d.getX(p)
calling getX directly gives 9.0
- Comment on this slide
Defining a Queue
- A queue can do three things:
- Enqueue an item
- Dequeue the front item
- Which raises an exception if the queue is empty
- Remember, error behavior is part of a class's interface too
- Report whether it is empty or not
- Which is less powerful than reporting its size
- A queue may store items in many different ways
- The simplest in Python is probably to use a list…
- …but it isn't the only one…
- …and whoever is using the class shouldn't care
- At least, not unless speed and/or memory requirements become an issue
- Implementation:
class Queue(object):
'''Implement a queue.'''
def __init__(self):
'''Construct an empty queue.'''
self.data = []
def enq(self, val):
'''Add a new value to the end of the queue.'''
self.data.append(val)
def deq(self):
'''Return the front value of the queue, or raise ValueError if
the queue is empty.'''
if len(self.data) == 0:
raise ValueError, "Can't dequeue from empty queue"
val = self.data[0]
self.data = self.data[1:]
return val
def empty(self):
'''Report whether the queue is empty or not.'''
return len(self.data) == 0
- Use docstrings instead of comments to document classes and methods (and functions)
- Unlike comments, they are kept at runtime to provide interactive help
print Queue.__doc__
print Queue.enq.__doc__
print q.deq.__doc__
Implement a queue.
Add a new value to the end of the queue.
Return the front value of the queue, or raise ValueError if
the queue is empty.
raise ValueError instead of using assert when someone tries to dequeue from an empty queue- The operation isn't a flaw in this code (which is what assertions are for)
- Question: should
point2d.setX and point2d.setY have raised ValueError as well?
- Always include tests
if __name__ == '__main__':
q = Queue()
assert q.empty()
try:
x = q.deq()
assert False
except ValueError:
pass
q.enq('first')
assert not q.empty()
q.enq('second')
assert q.deq() == 'first'
assert q.deq() == 'second'
assert q.empty()
- We'll see in the next two lectures how to do this systematically
- Comment on this slide
Special Methods
__init__ is just one example of a special method
- All have names beginning and ending with double underscore
- Give programmers a way to make their data types look like those built into Python
- Most widely used is probably
__str__
- When Python needs a string representation of an object, it sees if the object's class defines
__str__ - If it does, Python calls it, and uses the string it returns
- If it doesn't, Python creates a default string representation
class first(object):
def __init__(self, name):
self.name = name
class second(object):
def __init__(self, name):
self.name = name
def __str__(self):
return 'second/"%s"' % self.name
f = first("orange")
print "first object is", f
s = second("apple")
print "second object is", s
first object is <__main__.first object at 0x401d982c>
second object is second/"apple"
- Other examples:
- If
obj is an object, len(obj) calls obj.__len__()
- Easy to add to
Queue to report how long the queue is
x in obj is equivalent to obj.__contains__(x)
- Would also be easy to add to
Queue, but should we? - Don't want to create “kitchen sink” classes
class Queue(object):
def __init__(self):
self.data = []
def enq(self, val):
self.data.append(val)
def deq(self):
if len(self.data) == 0:
raise ValueError, "Can't dequeue from empty queue"
val = self.data[0]
self.data = self.data[1:]
return val
def __len__(self):
return len(self.data)
def empty(self):
return self.__len__() == 0
if __name__ == '__main__':
q = Queue()
q.enq('first')
q.enq('second')
assert len(q) == 2
q.deq()
q.deq()
assert q.empty()
- Comment on this slide
Inheritance
- What if we want two types of queue, one with
__len__ and one without? - More realistically, suppose we have two types of people: professors and students
- They have a lot in common: names, email addresses, etc.
- But each has things the other doesn't (e.g., an office, or courses taught vs. courses taken)
- Really don't want to write each method twice
- Anything repeated in two or more places will eventually be wrong in at least one.
- The object-oriented solution is inheritance
- One class extends its parent class by adding new state and methods
- Anything instances of the parent class can do, instances of the child class can do as well
- But not vice versa
- Extend any class with any number of child classes
- All children share the behavior of their parent
- But are completely independent of one another
- For example, define a
Person as:
class Person(object):
'''Represent generic traits of any person.'''
def __init__(self, name, email):
'''Create a new person with a name and an email address.'''
self.name = name
self.email = email
def getName(self):
return self.name
def getEmail(self):
return self.email
- Then derive
Professor:
class Professor(Person):
'''Professors are more than people...'''
def __init__(self, name, email, office):
Person.__init__(self, name, email)
self.office = office
def getOffice(self):
return self.office
- Note that it's
Professor(Person) instead of Professor(object)
- Signals that
Professor extends Person instead of the generic built-in object class
- Note also that
Professor's constructor calls Person's- Ensures that all of
Person's initialization is done without duplicating code - Ensures that changes to
Person will automatically be reflected in Person
- Now define
Student:
class Student(Person):
'''Students are more than people too...'''
def __init__(self, name, email, program):
Person.__init__(self, name, email)
self.program = program
def getProgram(self):
return self.program
- And run a few tests:
if __name__ == '__main__':
tb = Professor('Tycho Brahe', 't.brahe@copenhagen.dk', 'Room 241')
print tb.getName(), tb.getEmail(), tb.getOffice()
jk = Student('Johannes Kepler', 'j.kepler@uraniborg.org', 'B.A.')
print jk.getName(), jk.getEmail(), jk.getProgram()
Tycho Brahe t.brahe@copenhagen.dk Room 241
Johannes Kepler j.kepler@uraniborg.org B.A.
- Comment on this slide
Polymorphism
- Suppose we define
Person.__str__, then override it with more specific definitions in Professor and Student
- When Python looks for a method, it checks the object's class, then its parent class, and so on
![[Method Lookup]](img/oop/polymorphism.png)
Figure 12.2: Method Lookup |
- Result is that Python calls the most appropriate method for an object
class Person(object):
def __init__(self, name, email):
self.name = name
self.email = email
...
def __str__(self):
return '%s (%s)' % (self.name, self.email)
class Professor(Person):
def __init__(self, name, email, office):
Person.__init__(self, name, email)
self.office = office
...
def __str__(self):
result = Person.__str__(self)
result += ' Room %s' % self.office
return result
class Student(Person):
def __init__(self, name, email, program):
Person.__init__(self, name, email)
self.program = program
...
def __str__(self):
result = Person.__str__(self)
result += ' Program %s' % self.program
return result
if __name__ == '__main__':
crowd = [
Professor('Tycho Brahe', 't.brahe@copenhagen.dk', 'Room 241'),
Student('Johannes Kepler', 'j.kepler@uraniborg.org', 'B.A.')
]
for person in crowd:
print person
Tycho Brahe (t.brahe@copenhagen.dk) Room Room 241
Johannes Kepler (j.kepler@uraniborg.org) Program B.A.
- Polymorphism means “having more than one form”
- In object-oriented programming, it means being able to treat different types of things as interchangeable
- E.g., if every class derived from
Shape defines a method area, then obj.area() will return the area of obj without the caller having to know if obj is actually a square or a circle
- Comment on this slide
The Substitution Principle
- Python looks up each method when it's called
- So any two classes that define the same set of methods can be used interchangeably
- However, the usual way to implement polymorphism is via inheritance
- Only override things that need to change
- The Liskov Substitution Principle states that it should always be possible to replace an instance of a parent class with an instance of the child class
- Means that if
Parent.meth returns a string, Child.meth must also return a string - If
Parent.meth returns a non-empty string with at least three vowels, Child.meth must as well
- Often expressed in terms of pre-conditions and post-conditions
Child.meth must satisfy all the post-conditions of Parent.meth, and may impose more- In other words,
Child.meth's possible output is a subset of Parent.meth's - So any code that works correctly using the output of
Parent.meth will continue to work if given an instance of Child instead
Child.meth may ignore some of Parent.meth's pre-conditions, but may not impose more- Equivalently,
Child.meth accepts everything thatParent.meth did, and possibly more - So any code that could call
Parent.meth correctly is guaranteed to call Child.meth correctly too
- Comment on this slide
Class Members
- Sometimes want to share data between all instances of a class
- Constants, a count of the number of class instances created, etc.
- Any data members defined inside the
class block belong to the class as a wholeclass Counter(object):
num = 0 # Number of Counter objects created.
def __init__(self, name):
Counter.num += 1
self.name = name
if __name__ == '__main__':
print 'initial count', Counter.num
first = Counter('first')
print 'after creating first object', Counter.num
second = Counter('second')
print 'after creating second object', Counter.num
initial count 0
after creating first object 1
after creating second object 2
- Can also define methods that belong to the class, rather than to particular objects
- Out of scope for this course, but often useful
- Comment on this slide
Overloading Operators
- (Almost) every feature of Python objects can be overridden by defining the right special method
- If done carefully, makes code much easier to read
- If done haphazardly, makes it almost impossible to understand
- One common use is operator overloading
- The symbol
"+" is just a shorthand for a function that takes two arguments - If a class defines a method
__add__, it takes precedence over the built-in “function”
- So
x + y calls x.__add__(y)
- Of course, it's not quite that simple
2 + x and x + 2 don't always do the same thing- String concatenation
- Matrix multiplication
- Classes can define right-hand versions of operators, e.g.,
__radd__ instead of __add__
- If
x has an __add__ method, call that - Otherwise, if
y has an __radd__ method, call that - Otherwise, try Python's built-ins
- Example: a vector is sparse if most of its entries are zero
- No point padding eleven actual values with nine million zeroes
- So use a dictionary to record non-zero values and their indices
- Addition: create a new vector with a non-zero value wherever either operand had a non-zero value
- Dot product: add up products of matching non-zero values
- Length: return one more than the index of the largest non-zero value
- “One more” to be consistent with Python's lists
- Hm…what is the length of
v after:
v = SparseVector() # all values initialized to 0.0
v[27] = 1.0 # length is now 28
v[43] = 1.0 # length is now 44
v[43] = 0.0 # is the length still 44, or 28?
- This isn't a programming question; it's a data modeling question
- What does the data in the program have to do to model the real world accurately?
- The easy stuff
- Construction creates an empty sparse vector
- Its length is the largest index ever seen
- May or may not be sensible, but it's easy to implement
class SparseVector(object):
'''Implement a sparse vector. If a value has not been set
explicitly, its value is zero.'''
def __init__(self):
'''Construct a sparse vector with all zero entries.'''
self.values = {}
def __len__(self):
'''The length of a vector length is the largest index ever seen.'''
if len(self.values):
return max(self.values.keys())
return 0
- Making it look like a vector
- Subscripting with
[] is just a binary operator tooobj[index] is implemented as obj.__getitem__(index)obj[index] = value is obj.__setitem(index, value)- Note: all programming languages do this—the only question is whether they let you see it
def __getitem__(self, key):
'''Return an explicit value, or 0.0 if none has been set.'''
if key in self.values:
return self.values[key]
return 0.0
def __setitem__(self, key, value):
'''Assign a new value to a vector entry.'''
self.values[key] = value
- Dot product
- The other object (on the right side of
"*") is usually called other - No reason to insist that it be a sparse vector
- Could equally well be a list of values
- So loop over the indices of our own values, multiply by corresponding values in other object
- Any index not encountered in this loop doesn't matter, since it corresponds to something that's zero
def __mul__(self, other):
'''Calculate cross product of a sparse vector with something else.'''
result = 0.0
for k in self.values.keys():
result += self.values[k] * other[k]
return result
def __rmul__(self, other):
return self.__mul__(other)
- Note:
__rmul__ = __mul__ makes __rmul__ point to the same code as __mul__
- There's no magic here: methods are functions, are objects, are things you can assign to names
- Addition
- Trickier than multiplication: result is non-zero wherever either argument is non-zero
- Don't want to loop over all the zeroes of either argument
- Solution: if the other object is a sparse vector, cheat
- I.e., reach inside it, and rely on details of its implementation
def __add__(self, other):
'''Add something to a sparse vector.'''
# Initialize result with all non-zero values from this vector.
result = SparseVector()
result.values.update(self.values)
# If the other object is also a sparse vector, add non-zero values.
if isinstance(other, SparseVector):
for k in other.values.keys():
result[k] = result[k] + other[k]
# Otherwise, use brute force.
else:
for i in range(len(other)):
result[i] = result[i] + other[i]
return result
# Right-hand add does the same thing as left-hand add.
__radd__ = __add__
- And finally, some tests
if __name__ == '__main__':
x = SparseVector()
x[1] = 1.0
x[3] = 3.0
x[5] = 5.0
print 'len(x)', len(x)
for i in range(len(x)):
print '...', i, x[i]
y = SparseVector()
y[1] = 10.0
y[2] = 20.0
y[3] = 30.0
print 'x + y', x + y
print 'y + x', y + x
print 'x * y', x * y
print 'y * x', y * x
z = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
print 'x + z', x + z
print 'x * z', x * z
print 'z + x', z + x
- Comment on this slide
Structured Unit Testing
A Unit Testing Framework
- Unit testing follows a pattern
- Setup and teardown
- Lots of small, independent tests
- Reporting
- Aggregation (combine tests into a test suite, and test suites into larger suites)
- See a pattern, build a framework
- Write shared code once
- Encourage people to work a certain way
- I.e., make it easy for them to do things right
- JUnit is a testing framework originally written by Kent Beck and Erich Gamma in 1997
- Widely adopted
- Now integrated into almost all Java IDEs
- Made testing easy enough that programmers actually started doing it
- Widely imitated:
- Workalikes are available C++, Perl, .NET, etc.
- Once you know one, you can easily learn and use the others
- Widely extended
- Add-ons for measuring test execution times, recording tests, testing web applications, etc.
- Comment on this slide
Mechanics
- Python's standard unit testing framework is the
unittest module - What you do:
import unittest- Derive a class (or several classes) from
unittest.TestCase - Define one method for each test
- Name must begin with
"test" - Takes no argument (other than
self) - Returns nothing
- Call
unittest.main()
- What the framework does:
- Search the program to find classes derived from
TestCase
- See how this works in a later lecture
- Search each class to find methods whose names begin with the word
"test" - For each method:
- Create a new instance of the class
- Call the method
- Record pass, fail, or error
- Actually check things inside test methods using methods provided by
TestCase
- Allows the framework to distinguish between test assertions, and normal
assert statements- Since the code being tested might use the latter
- Checking methods include:
assert_(condition): check that something is true (note the underscore)assertEqual(a, b): check that two things are equalassertNotEqual(a, b): the reverse of the aboveassertRaises(exception, func, …args…): call func with arguments (if provided), and check that it raises the right exceptionfail(): signal an unconditional failure
- Comment on this slide
Testing a Function
- You want to test a function that calculates a running sum of the values in the list
- Replaces
[a, b, c, …] with [a, a+b, a+b+c, …] - Test cases:
- Empty list
- Single value
- Long list with mix of positive and negative values
- Hm… Is it supposed to:
- Return a new list?
- Modify its argument in place and return that?
- Modify its argument and return
None?
- Your tests can only ever be as good as (your understanding of) the spec
- We'll assume it's supposed to return a new list
- First implementation of
runningSum:
import sys, unittest
def runningSum(seq):
result = seq[0:1]
for i in range(2, len(seq)):
result.append(result[i-1] + seq[i])
return result
class SumTests(unittest.TestCase):
def testEmpty(self):
self.assertEqual(runningSum([]), [])
def testSingle(self):
self.assertEqual(runningSum([3]), [3])
def testDouble(self):
self.assertEqual(runningSum([2, 9]), [2, 11])
def testLong(self):
self.assertEqual(runningSum([-3, 0, 3, -2, 5]), [-3, -3, 0, -2, 3])
if __name__ == '__main__':
unittest.main()
F.E.
======================================================================
ERROR: testLong (__main__.SumTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "running_sum_wrong.py", line 21, in testLong
self.assertEqual(runningSum([-3, 0, 3, -2, 5]), [-3, -3, 0, -2, 3])
File "running_sum_wrong.py", line 6, in runningSum
result.append(result[i-1] + seq[i])
IndexError: list index out of range
======================================================================
FAIL: testDouble (__main__.SumTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "running_sum_wrong.py", line 18, in testDouble
self.assertEqual(runningSum([2, 9]), [2, 11])
File "c:\Python23\lib\unittest.py", line 302, in failUnlessEqual
raise self.failureException, \
AssertionError: [2] != [2, 11]
----------------------------------------------------------------------
Ran 4 tests in 0.000s
FAILED (failures=1, errors=1)
- Notice that test methods deserve to have meaningful names too
- Can you spot the error?
- Default output gives the stack trace, to help you find the problem
- You can also provide messages to the
TestCase.assert family of methods
- Tests are run in an arbitrary order
- Classes store methods in a dictionary for fast lookup
- Means that there's no guarantee they'll be run in the order in which you defined them
- Yet another good reason to make each test independent
- Fix the method and re-run the tests
def runningSum(seq):
result = seq[0:1]
for i in range(1, len(seq)):
result.append(result[i-1] + seq[i])
return result
....
----------------------------------------------------------------------
Ran 4 tests in 0.000s
OK
- Everything works
- Should you really go to this much effort to test a simple function?
- Took less than a minute to write the four tests
- Uncovered one gap in the requirements, and one error in the first implementation
- Able to verify the fix almost instantly
- Sounds pretty good to me…
- Comment on this slide
Eliminating Redundancy
- Setting up a fixture can often be more work than writing the test
- Complex data structures, external files, etc.
- If the test class defines a method called
setUp, the framework will call it before running each test- And if there's a method called
tearDown, it will be run after each test
- Example: test whether a function can trace Make-like dependencies correctly
- Several kinds of tests to run, and several fixtures to run them on
- So create all the fixtures in
setUp, and use them in the test methods- Alternatively, create a separate test class for each fixture
- Worth doing if creating a fixture takes so long that you don't want any redundant setup
class TestDepends(unittest.TestCase):
def setUp(self):
self.f0 = {}
self.f1 = {'a' : []}
self.f2 = {'a' : ['a']}
self.f3 = {'a' : ['b'], 'b' : []}
self.f4 = {'a' : ['b', 'c'], 'b' : ['a'], 'c' : ['d'], 'd' : []}
def testCanReachSelf(self):
self.assert_(dependsOn(self.f1, 'a', 'a'))
self.assert_(dependsOn(self.f2, 'a', 'a'))
self.assert_(dependsOn(self.f3, 'b', 'b'))
self.assert_(dependsOn(self.f4, 'c', 'c'))
def testSingleStep(self):
self.assert_(dependsOn(self.f3, 'a', 'b'))
self.assert_(dependsOn(self.f4, 'a', 'b'))
self.assert_(dependsOn(self.f4, 'c', 'd'))
def testBackward(self):
self.assert_(dependsOn(self.f4, 'b', 'a'))
def testMultiStep(self):
self.assert_(dependsOn(self.f4, 'a', 'd'))
self.assert_(dependsOn(self.f4, 'b', 'd'))
def testNegative(self):
self.assert_(not dependsOn(self.f1, 'a', 'c'))
self.assert_(not dependsOn(self.f3, 'b', 'a'))
- Comment on this slide
Testing for Failure
- Testing that code fails in the right way is just as important as testing that it does the right thing
- Otherwise, someone will do something wrong some day, and the code won't report it
- Two ways to do it:
- Use
TestCase.assertRaises to check that a specific function raises a specific exception - Use
try/except yourself- Have to do it this way in C++, Java, and most other sturdy languages
- Example: testing a function that finds all values inside a double-ended range
- Raises
ValueError if the range is empty, or if the set of values is empty class TestInRange(unittest.TestCase):
def testNoValues(self):
try:
inRange([], 0.0, 1.0)
except ValueError:
pass
else:
self.fail()
def testBadRange(self):
try:
inRange([0.0], 3.0, 3.0)
inRange([0.0], 4.0, -2.0)
except ValueError:
pass
else:
self.fail()
- Comment on this slide
Testing I/O
- Input and output often seem hard to test
- Put a bunch of fixture files and expected results in a testing directory?
- Create temporary files when and as tests execute?
- The best answer is to use fake I/O using strings
- Python's
StringIO and cStringIO modules can read and write strings instead of files - Similar packages exist for C++, Java, and other languages
- This only works if the function being tested takes streams as arguments, rather than filenames
- If the function opens and closes the file, no way for you to substitute a fake file
- An example of designing code to make it testable
- Example: find lines where two files differ
- Nothing as fancy as
diff, just a line-by-line comparison - Input: two streams (which might be open files or
StringIO wrappers around strings) - Output: another stream (i.e., a file, or a
StringIO) - As a side effect, we've made the function itself more useful
- People can now use it to compare strings, or strings and files
class TestDiff(unittest.TestCase):
def wrapAndRun(self, left, right, expected):
left = StringIO(left)
right = StringIO(right)
actual = StringIO()
diff(left, right, actual)
self.assertEqual(actual.getvalue(), expected)
def testEmpty(self):
self.wrapAndRun('', '', '')
def testLengthyMatch(self):
str = 'a\nb\nc\n'
self.wrapAndRun(str, str, '')
def testSingleLineMismatch(self):
self.wrapAndRun('a\n', 'b\n', '1\n')
def testMiddleMismatch(self):
self.wrapAndRun('a\nb\nc\n', 'a\nx\nc\n', '2\n')
- Comment on this slide
Testing With Classes
- Testing classes isn't much different from what we've already seen
- Create objects, call their methods, check the results
- But how can we keep tests independent?
- Can't test methods unless we can construct objects…
- …but the only way to see if objects have been constructed properly is to invoke their methods
- The answer is to keep tests as independent as possible
- The first tests you write use simple constructors, and simple inspector methods
- Once those pass, start writing more complex tests
- They probably won't be executed in this order once they're all written…
- …but they still give you somewhere to set a breakpoint when simple things start to go wrong
- Much more interesting to think about how to use object-oriented ideas in constructing tests
- Create one class that has a lot of tests
- Derive other test classes that:
- Create specific fixtures
- Run specific tests
- By default, the framework will run all the generic tests for the specific classes as well
![[Object-Oriented Testing]](img/unit/oo_tests.png)
Figure 13.1: Object-Oriented Testing |
- This only works if the tests in the base class are generalized
- A little more setup work for a big payoff
- Example: billing application for a software consulting firm
- Each consultant may be assigned to zero or more projects
- Each project may have zero or more consultant currently assigned to it
- The
Assignment class keep track of who is currently assigned to what- Use strings for consultant and project IDs for now
- Create a base class
CommonTests to hold shared tests- Do not derive this from
unittest.TestCase - If we did, the framework would automatically try to run it…
- …and it's not designed to be run on its own
- This class does not create a fixture
- That's the job of the derived class
- Add a test
- Each test records the state of the fixture…
- …does something…
- …and then checks that the fixture is in the right final state
- Remember, the fixture is recreated from scratch for each test
class CommonTests(object):
'''Tests that can be applied to all fixtures.'''
def testAddNew(self):
'''Check that adding a single new assignment works.'''
priorConsultants = self.fixture.getConsultants()
priorProjects = self.fixture.getProjects()
self.fixture.addAssignment('Alan', 'tables')
self.assertEqual(self.fixture.getByConsultant('Alan'),
Set(['tables']))
self.assertEqual(self.fixture.getByProject('tables'),
Set(['Alan']))
self.assertEqual(self.fixture.getConsultants(),
Set(['Alan']) | priorConsultants)
self.assertEqual(self.fixture.getProjects(),
Set(['tables']) | priorProjects)
- Derive another class from both
CommonTests and unittest.TestCase
- The former ensures that it inherits all the common test cases
- The latter ensures that the framework will find it, and know how to run it
- This class creates a fixture, and adds a few fixture-specific tests
class TestEmpty(unittest.TestCase, CommonTests):
'''Test an empty assignment table.'''
def setUp(self):
'''Create the empty assignment table.'''
self.fixture = Assignment()
def testNoConsultants(self):
'''Make sure that nonexistent consultants aren't present.'''
self.assertEqual(self.fixture.getByConsultant('nobody'), Set())
def testNoProjects(self):
'''Make sure that nonexistent projects aren't present.'''
self.assertEqual(self.fixture.getByProject('nothing'), Set())
- Run it
- Framework finds
TestEmpty - Goes through the test cycle (test, setup, run, teardown) for each
testXYZ method in it, and in CommonTests
- Now add another shared test to
CommonTests:
def testAddDelNew(self):
'''Check that adding and then removing an assignment
leaves things as they were.'''
priorConsultants = self.fixture.getConsultants()
priorProjects = self.fixture.getProjects()
self.fixture.addAssignment('Alan', 'tables')
self.fixture.delAssignment('Alan', 'tables')
self.assertEqual(self.fixture.getConsultants(), priorConsultants)
self.assertEqual(self.fixture.getProjects(), priorProjects)
- And/or derive another test class:
class TestMulti(unittest.TestCase, CommonTests):
'''Set up common tests with multiple assignments already present.'''
def setUp(self):
self.fixture = Assignment()
self.fixture.addAssignment('Bhargan', 'dishes')
self.fixture.addAssignment('Harald', 'dishes')
self.fixture.addAssignment('Harald', 'floors')
self.fixture.addAssignment('Rachel', 'floors')
self.fixture.addAssignment('Rachel', 'counters')
self.fixture.addAssignment('Sally', 'counters')
- Multiplicative effect: (common tests × derived classes) + specific tests
- Comment on this slide
Test-Driven Development
- With test-driven development (TDD), you write your tests, then write the code
- A key element of agile development methodologies like eXtreme Programming…
- …but it was around long before they were…
- …and pays off in almost every context
- Seems backward, but has very real advantages:
- Gives programmers a definite goal
- Coding is finished when all tests run
- Particularly useful when trying to fix bugs in old code, as it forces you to figure out how to recreate the bug
- Helps prevent the “one more feature” syndrome
- Ensures that tests actually get written
- People are often too tired, or too rushed, to test after coding
- Helps clarify the API before it is set in stone
- If something is awkward to test, it can be redesigned before it's written
- A great way to communicate requirements
- I write the tests
- “All” you have to do is write code that passes those tests
- Example: I want you to write a function that:
- A single rectangle (represented as an instance of class
Rect) as its first argument - And a list of rectangles (each also an instance of
Rect) as its second argument - And returns
True if the rectangles in the list completely cover the first rectangle, or False if they don't - Best way for me to specify everything I want is to write some test cases
class TestOverlay(unittest.TestCase):
def testNoRects(self):
self.assert_(not overlay(Rect(0, 0, 1, 1), []))
def testWithSelf(self):
r = Rect(0, 0, 1, 1)
self.assert_(overlay(r, [r]))
def testHalfOnly(self):
self.assert_(not overlay(Rect(0, 0, 2, 2),
[Rect(0, 0, 1, 2)]))
def testTwoHalves(self):
self.assert_(overlay(Rect(0, 0, 2, 2),
[Rect(0, 0, 1, 2), Rect(1, 0, 2, 2)]))
- Comment on this slide
Exercises
Exercise 13.1:
Python has another unit testing module called doctest.
It searches files for sections of text that look like interactive
Python sessions, then re-executes those sections and checks the
results. A typical use is shown below.
def ave(values):
'''Calculate an average value, or 0.0 if 'values' is empty.
>>> ave([])
0.0
>>> ave([3])
3.0
>>> ave([15, -1.0])
7.0
'''
sum = 0.0
for v in values:
sum += v
return sum / float(max(1, len(values)))
if __name__ == '__main__':
import doctest
doctest.testmod()
Convert a handful of the tests you have written for other
questions in this lecture to use doctest. Do you prefer it
to unittest? Why or why not? Do you think doctest
makes it easier to test small problems? Large ones? Would it be
possible to write something similar for C, Java, Fortran, or
Mathematica?
Automated Builds
How Do You Rebuild A Program?
- Most programming languages require you to compile programs before running them
- E.g., turn C source files into machine-specific object files
- Which usually have a
.o extension
- It's easy enough to create
overlap.o from overlap.c: just run cc overlap.c - But what if you have many source files?
cc *.c is slow and inefficient if only one of your 117 source files has changed
- Worse, what if some files rely on others?
- Suppose the code in
overlap.c uses code in rectangle.c - If you change both, but only recompile
overlap.c, you'll be using out-of-date rectangles
- Comment on this slide
Automate, Automate, Automate
- Remember rule #2: anything worth repeating is worth automating
- Computers are good at details, and don't get bored, so have them do repetitive tasks
- We need:
- A way to describe the tasks (what things to do)
- A way to specify the dependencies between them (when to do things)
- Most widely used tool for this is called Make
- Invented in 1975 by Stuart Feldman at Bell Labs
[Feldman 1979]
- The good news is that Make will:
- Figure out what has changed
- Work out what is affected by those changes
- Execute commands to bring things up to date (e.g., by recompiling)
- Commands are handed to the shell for execution, so they're the same ones you would run yourself
- The bad news is Make's syntax
- Initially very simple, but has grown more complex over time
- Has turned into a little programming language in its own right (rule #11)
- We will ignore the complex bits for now, and take a look at a better way to solve these problems in Backward, Forward, and Sideways
- Comment on this slide
Our Example
- Make doesn't care what language you use
- In fact, it doesn't even care if you're using it to recompile things
- Often used to run tests, keep web sites up to date, etc.
- Any task where one action depends on others is a good candidate
- Running example: Sakura Synthesis is studying organic fullerene production
- Automated laboratory equipment runs experiments in batches to create files like this:
Time: 1.2271
Concentration: 0.0050
Yield: 11.41
Time: 2.5094
Concentration: 0.0055
Yield: 11.20
Time: 3.7440
Concentration: 0.0060
Yield: 10.90
- Each experiment produces 20-30 files
- May eventually have several thousand of them
- Want to:
- Generate summary files showing the results for particular trials in tabular form
- Update a file showing the correlation between concentrations and yields based on those tabular summaries
- Comment on this slide
Hello, Make
- Suppose we already have a program called
dat2csv.py that reformats data as a table - Put the following into a file called
hello.mk:
hydroxyl_422.csv : hydroxyl_422.dat
python dat2csv.py hydroxyl_422.dat > hydroxyl_422.csv
- Note: that indentation must be a tab character, not eight spaces, or a mix of spaces and tabs
- Yeah, it's a wart, but we're stuck with it
- Some terminology:
![[Structure of a Make Rule]](img/make/make_structure.png)
Figure 14.1: Structure of a Make Rule |
hydroxyl_422.csv is the target of the rulehydroxyl_422.dat is its prerequisite
- We'll see in a moment that a target may have many prerequisites
- The compilation command is the rule's action
- A rule may also have many actions
- Run
make -f hello.mk
- Make sees that the CSV file depends on the data file
- Since the CSV file doesn't exist, Make executes the rule's action (i.e., runs
dat2csv.py)
- Run
make -f hello.mk againhydroxyl_422.csv is newer than its prerequisite, so the action is not executed
- Comment on this slide
Multiple Targets
- Very few programs consist of a single file
- This Makefile describes two files that are independent of one another
hydroxyl_422.csv : hydroxyl_422.dat
python dat2csv.py hydroxyl_422.dat > hydroxyl_422.csv
methyl_422.csv : methyl_422.dat
python dat2csv.py methyl_422.dat > methyl_422.csv
- When you run
make -f double.mk, only hydroxyl_422.csv is compiled- The first target in the Makefile is the default target
- Unless you tell it otherwise, that's all Make will update
- Run
make -f double.mk methyl_422.csv to build methyl_422.csv - Comment on this slide
Phony Targets
- Obviously don't want to have to run Make separately for each possible target
- That would hardly count as “automation”
- Solution: create a phony target that:
- Depends on all the things you want to recompile, but
- Doesn't correspond to any files, and so is never up to date
- Making it therefore always executes its actions
all : hydroxyl_422.csv methyl_422.csv
hydroxyl_422.csv : hydroxyl_422.dat
python dat2csv.py hydroxyl_422.dat > hydroxyl_422.csv
methyl_422.csv : methyl_422.dat
python dat2csv.py methyl_422.dat > methyl_422.csv
- Now,
make -f phony.mk all recreates both .csv files
- Typical phony targets in a software project:
"all": recompile everything"clean": delete all temporary files, and everything produced by compilation"install": copy files to system directories- Many open source projects are installed by typing
make, and then make install
- Comment on this slide
Automatic Variables
- Make defines automatic variables to represent parts of rules
- These variables are automatically redefined for each rule
- Names are unfortunately very cryptic
- Keep a reference card handy
"$@" | The rule's target |
"$<" | The rule's first prerequisite |
"$?" | All of the rule's out-of-date prerequisites |
"$^" | All prerequisites |
Table 14.1: Automatic Variables in Make
- Example
all : hydroxyl_422.csv methyl_422.csv
hydroxyl_422.csv : hydroxyl_422.dat
python dat2csv.py $< > $@
methyl_422.csv : methyl_422.dat
python dat2csv.py $< > $@
- Note: by default, Make echoes actions before executing them
- Putting
"@" at the start of the action line prevents this
- Comment on this slide
Pattern Rules
- Most files of similar type (Java, C++, etc.) are compiled the same way
- Rule #2 again: write a pattern rule specifying what to do in the general case
- Use the wildcard
"%" to represent the stem of the file's name in the target and prerequisites- Must use automatic variables in the actions
- This is why they were invented
all : hydroxyl_422.csv methyl_422.csv
%.csv : %.dat
python dat2csv.py $< > $@
clean :
@rm -f *.csv
- Notice that there's another phony target in this Makefile
make -f rule.mk clean will get rid of all the generated files- The
-f flag stands for “force”
- Tells
rm not to report an error when trying to delete files that don't exist
- Comment on this slide
Dependencies
- Next step: create an overall summary for each treatment
- E.g., combine
hydroxyl_422.csv and hydroxyl_480.csv to create hydroxyl_all.csv, and similarly for the methyl files, using summarize.py
hydroxyl_all must be rewritten if it is older than either hydroxyl_422.csv or hydroxyl_480.csv
- Which must in turn be rewritten if they're newer than their corresponding
.dat files
- In this case, also make the results depend on
summarize.py and dat2csv.py
- If the programs change, there's a good chance the data file format has changed
- Or a bug has been found and fixed
- Updated Makefile, and its output when no CSV files exist, are:
all : hydroxyl_all.csv methyl_all.csv
%_all.csv : %_422.csv %_480.csv summarize.py
python summarize.py $^ > $@
%.csv : %.dat dat2csv.py
python dat2csv.py $< > $@
clean :
@rm -f *.csv
python dat2csv.py hydroxyl_422.dat > hydroxyl_422.csv
python dat2csv.py hydroxyl_480.dat > hydroxyl_480.csv
python summarize.py hydroxyl_422.csv hydroxyl_480.csv > hydroxyl_all.csv
python dat2csv.py methyl_422.dat > methyl_422.csv
python dat2csv.py methyl_480.dat > methyl_480.csv
python summarize.py methyl_422.csv methyl_480.csv > methyl_all.csv
rm hydroxyl_480.csv methyl_422.csv hydroxyl_422.csv methyl_480.csv
- Notice that Make automatically removes intermediate files at the end
- Often visualize dependencies as a directed graph
- Dependencies can have dependencies, and so on
![[Visualizing Dependencies]](img/make/visualize_depend.png)
Figure 14.2: Visualizing Dependencies |
- Mark all the files that are out of date, and follow arrows to find out what will happen
- Note: Make can execute actions in any order it wants to, as long as it doesn't violate dependency ordering
- Notice that the rule
%_all.csv takes precedence over the rule %.csv
- Make uses the most specific rule available
- Comment on this slide
Defining Macros
- Often want to define values inside a Makefile
- E.g., the directory programs are to send their output to
- Or optimization flags for a compiler
- Do not want to set these options over and over again
- Rule #3: anything repeated in two or more places will eventually be wrong in at least one
- Solution: define variables (usually called macros)
- E.g., specify the directory where the lab machine puts data files:
INPUT_DIR = /lab/gamma2100
all : hydroxyl_all.csv methyl_all.csv
%_all.csv : %_422.csv %_480.csv
python summarize.py $^ > $@
%.csv : ${INPUT_DIR}/%.dat
python dat2csv.py $< > $@
clean :
@rm -f *.csv
- Define by specifying name and value
- To get value, put a
"$" in front of the name, and parentheses or braces around it- Without the parentheses, Make will interpret
"$XYZ" as the value of the variable "X", followed by the characters "YZ" - Yes, it's ugly
- Sometimes useful to pass values into Make when invoking it
- E.g., change the input directory
- Rather than rewriting the Makefile each time, you can specify
name=value pairs on the command line- Define a macro with the default value
- Override it when you want to
- So:
make -f macro.mk sets INPUT_DIR to /lab/gamma2100- But
make INPUT_DIR=/oldcrankymachine -f macro.mk uses /oldcrankymachine
- Make also looks at environment variables
- You can refer to
${HOME} in a Makefile without having defined it - Example:
VAL = original
echo :
@echo "VAL is" ${VAL}
$ make -f env.mk VAR=changed echo
- Comment on this slide
Analysis
- Pro
- Simple things are simple to do…
- …and not too difficult to read…
- …especially compared to the alternatives
- Con
- The syntax is unpleasant
- Complex things are difficult to read…
- …and even more difficult to debug
- Best you can do is use
echo to print things as Make executes
- Not really very portable
- Hands commands to the shell for execution
- But commands use different flags on different operating systems
- Do you use
del or rm to delete files?
- Alternatives
- Ant
- primarily for Java
- Less platform-dependent…
- …but just as hard to read and debug
- Integrated development environments
- Most are proprietary and platform-specific
- SCons
- Give users a library to manage dependencies and actions
- “Makefile” is actually a small program
- More powerful, and debuggable, but steeper learning curve
- Once builds are automated, the next step is to run them continuously
- Every time someone checks something into version control, rebuild the software (or site), and re-run tests
CruiseControl and Bitten will both handle the details
- Comment on this slide
Exercises
Exercise 14.1:
How can you stop Make from removing intermediate files
automatically when it finishes processing?
Exercise 14.2:
Make gets definitions from environment variables,
command-line parameters, and explicit definitions in Makefiles.
What order does it check these in?
Coding Style and Reading Code
Introduction
- Good programming style is very hard to teach
- All the “rules” are either banal or overly restrictive
- Saying, “Make methods short, but not too short” doesn't really help
- Neither does saying, “No method shall be longer than sixty lines”
- Matters aren't helped by the fact that the strength of programmers' opinions is inversely proportional to the amount of data they have
- But good style is tremendously important
- Well-written code is less likely to contain errors
- And is much easier to maintain
- Rather than teaching you how to write well, this lecture will focus on how to read code
- How to figure out how the code works
- How to find what you need to know before making a change
- Hope that you'll pick up ideas about good style along the way
- For a comprehensive guide to how open source Unix applications are (or should be) structured, try
[Spinellis 2003]
- Comment on this slide
Why Read Code?
- People doing creative work in almost every field routinely inspect, dissect, and critique what's come before
- Would you hire an architect who'd never looked at anyone else's buildings?
- Programming is a sad exception
- Most students only ever read one- or two-page fragments in textbooks
- Like reading sonnets, then being asked to write a novel
- Open source has made millions of lines of code available for inspection
- Some good, some bad, most average
- Knowing how to read code is as useful as knowing how to read a proof
- Learn by example how to accomplish specific things
- Figure out how to make changes in a particular program
- Figure out how to use the program
- Code inspections are essential for achieving and maintaining quality
- Comment on this slide
Seven Plus or Minus
- Known since the 1950s that human short term memory can hold 7±2 items
[Hock 2004]
- Seven random digits (as in phone numbers)
- Seven tasks that still have to be done
- Note: this is just an average
- Some prodigies can remember thousands of pieces of information for short periods of time
- If we try to remember more than that, we either:
- Make mistakes, or
- Create chunks so that we can remember things at a higher level
- Musicians don't remember individual chords; they remember IV-V-I progression
- Chess players remember “castled kingside”, rather than the positions of five pieces
![[Chunking in Short-Term Memory]](img/readstyle/castling.png)
Figure 15.1: Chunking in Short-Term Memory |
-
[Chase & Simon 1973]
studied what happened when novice and master chess players are shown actual and random positions
- Masters (naturally) remember actual positions better
- But they remember random positions worse
- They're “seeing” patterns that aren't there
![[Actual Chess Position]](img/readstyle/chess_actual.png)
Figure 15.2: Actual Chess Position |
| ![[Retention of Actual Chess Position]](img/readstyle/chess_actual_graph.png)
Figure 15.3: Retention of Actual Chess Position |
|
![[Random Chess Position]](img/readstyle/chess_random.png)
Figure 15.4: Random Chess Position |
| ![[Retention of Random Chess Position]](img/readstyle/chess_random_graph.png)
Figure 15.5: Retention of Random Chess Position |
|
- What does this have to do with programming?
- When you are reading and writing code, you have to keep a bunch of facts straight in your head for a short period of time
- What is this function supposed to return?
- What parameters does it have, and what do they mean?
- What does this loop's index refer to?
- Etc.
- The more odds and ends you require readers (including yourself) to keep track of, the greater their error rates will be
- Fundamental rule of good programming style is therefore to limit the amount of information needed at any one time
- Note that a hundred one-line methods is not an improvement over one hundred-line method
- A related rule: the greater the difference, the more likely we are to notice it
- So every difference in naming and layout ought to mean something…
- …because readers' brains will assume that it does
- Conversely, every significant difference ought to be visually distinct
- Comment on this slide
What Does This Function Do?
- Start with a simple question: how to figure out what this code does?
# Fill in citations.
def bibRef(root, bibOutput, bibInfo):
cites = root.getElementsByTagName('cite')
for cite in cites:
ref = str(cite.getAttribute('ref'))
if (not ref) or (ref not in bibInfo):
print "Cannot satisfy '%s'" % ref
assert False
cite.setAttribute('display', bibInfo[ref])
cite.setAttribute('href', '%s#%s' % (bibOutput, ref))
- First question: what's the context?
- In this case, it's part of a program to fill in citations
<cite ref="castro-xml"/> becomes <cite display="Castro 2002" href="bib.html#castro-xml" ref="castro-xml"/>
- Ways to answer the question:
- Guess based on function name
- Assume
bibRef stands for “bibliographic reference” - But it isn't a particularly informative name
- Read embedded documentation (such as comments)
- “Fill in citations.” is more helpful than
bibRef - Still doesn't tell us what
root, bibOutput, and bibInfo are - Or what side effects (if any) this function has
- Read the function body
- For each citation element in
body… - …get the value of the element's
ref attribute… - …make sure it's in
bibInfo (presumably a dictionary)… - …set the citation's
display and href attributes - Hm…that means
bibOutput must be the name of the output file - Notice how the explanation creates larger chunks from raws lines of code
- Consult external documentation
- In this case, there doesn't appear to be any
- Examine call(s) to the function
- First job is to find it (or them)
grep bibRef expandrefcite.py finds exactly one call # Substitute information.
for filename in args:
filename, _, _, root = getFile(filename)
genTime(root)
crossRef(root, xrefInfo)
bibRef(root, bibOutput, bibInfo)
outfile = open(filename, 'w')
outfile.write(root.toxml('utf-8'))
outfile.close()
- Looking higher up, see that
bibOutput is initialized to bib.html, so our guess was pretty good
- You use these techniques (consciously or otherwise) whenever you are reading code
- The purpose of good style is to improve the odds of them working
- A good way to think about size and scope is the notion of a program slice
- The subset of names in scope at a particular statement that are needed to understand what that statement does
- If the slice is much smaller than the method itself, the method is probably bloated
- Comment on this slide
Naming
- Names of files, classes, methods, variables, and other things are the most visible clue to purpose
- A variable called
temperature probably doesn't store a user name or the number of pottery shards found at a dig site
- Should therefore choose names that are both meaningful and readable
current_surface_temperature_of_probe is meaningful, but not readablecstp is easier to read, but hard to understand
- The smaller the scope of a name, the more compact it can be
- It's OK to use
i and j for indices in tightly-nested for loops - But not OK if the loop bodies are several pages long
- Of course, they shouldn't be anyway…
- The wider the scope of a name, the more descriptive it has to be
- Call a class
ExperimentalRecord, rather than ER or ExpRec
- Compare:
| Before | After |
|---|
import reader,splitter,transpose
import sys,os
a=[]
b=[]
c=[]
d=sys.argv[1]
a=reader.readLinesFromFile(d)
b=splitter.splitIntoSections(a)
c=d.split('.')
for i in range(len(b)):
if os.path.isfile('%s.%d.dat'%(c[0],i+1)):
print '%s.%d.dat already exists and cannot be overwritten!'%(c[0],i+1)
break
else:
output=file('%s.%d.dat'%(c[0],i+1),'w')
print>>output,transpose.transpose(b[i])
output.close()
| import sys, os
import reader, splitter, transpose
inputFileName = sys.argv[1]
lines = reader.readLinesFromFile(inputFileName)
sections = splitter.splitIntoSections(lines)
fileNameStem = inputFileName.split('.')[0]
for i in range(len(sections)):
outputFileName = '%s.%d.dat' % (fileNameStem, i+1)
if os.path.isfile(outputFileName):
print '%s already exists and cannot be overwritten!' % outputFileName
break
else:
output = file(outputFileName, 'w')
print >> output, transpose.transpose(sections[i])
output.close()
|
- Use naming conventions to indicate purpose
- Constants written
ALL_UPPER_CASE_WITH_UNDERSCORES - Class names written using
CamelCase with first letter capitalized- Name “camel case” comes from the humps in the middle
- Parameters, methods, local variables, etc., use
camelCase with first letter in lower case
- Many other conventions make just as much sense
- Write parameters, etc., using underscores, as in
current_maximum - Write class members like
this_, _this, or m_this
- Not as useful in Python, where you have to refer to members as
self.this anyway - But useful in C++, Java, and other languages
- Use an agreed list of abbrevations
currAveTemp instead of currentAverageTemperature- “Agreed” is important: mixing
currAvTemp and currAveTemp produces hard-to-spot bugs
- Most important thing is to be consistent
- Anything consistent is readable after a while
- Just watch kids learning to read French, Punjabi, and Korean
- But your brain automatically notices visual differences, and assumes they're important
- It actually does take longer to read code written in a mix of styles…
- …and the error rate is higher
- Conform to
PEP-008: Python Style Guide unless you have a really good reason not to - Comment on this slide
Idioms
- Every language (human or computer) has idioms
- Sayings or phrases whose meaning does not follow from the meaning of the words it is made up of
- E.g., “face the music” does not mean “look at the orchestra”
- A way to aggregate raw data into chunks
- As you read programs, look for uses of common idioms
- Example: C/C++ and Java allow assignment in the middle of an expression
z = (y = x) + 2 assigns the value of x to y, then adds 2 to that and assigns it to z
- No one would (or should) ever use it in the manner show above, but it makes certain kinds of loops more compact
| Without Idiom | Using Idiom |
|---|
line = infile.readLine();
while (line != null) {
...do something with line...
line = infile.readLine();
}
| while ((line = infile.readLine()) != null) {
...do something with line...
}
|
- Assign next line (if any) to
text, then check for end of input, all in the top of the loop
- This can be even more compact in C and C++, which treat null as equivalent to false
while (line = infile.readLine()) {
...do something with line...
}
- But Python doesn't allow assignment in the middle of expressions
- And in older versions, you couldn't iterate directly over a file using
for line in input: - Solution was an idiom that looked like an infinite loop, but had a
break statement in it to handle end-of-input while 1:
line = infile.readline()
if not line:
break
...do something with line...
- Note: using 1 instead of
True, since older Pythons didn't define True either
- You can figure out what the code is doing even if you don't know the idiom…
- …but knowing the idiom makes it a lot simpler for your brain to chunk what it's reading
- Style guideline: use your language's idioms
- If everyone else writes their code a certain way, write yours that way too
- I never liked Python's
while 1: break idiom, but I used it
- Two ways to learn idioms:
- Books on programming style (e.g., the “Effective” series from Addison-Wesley)
- Reading other people's code
- Note: idioms evolve over time
- Languages gain new features, programmers learn new tricks, etc.
- People who have grown up with Java tend to write cleaner C than people who learned it as a first language
- So when reading code, prefer the style of newer code over that of older stuff
- Comment on this slide
Style Tools
- A growing number of tools exist to check and enforce style guidelines
- Look for things that the language spec and compilers allow, but which make code harder to read
PyLint
- Parses programs to create an annotated syntax tree
- A data structure that represents classes, loops, variables, etc.
![[Annotated Syntax Tree]](img/readstyle/ast.png)
Figure 15.6: Annotated Syntax Tree |
- Then searches the tree to find things that match problem patterns
- Each pattern implemented as a class that looks for certain things in the tree
- Which means that users can write new checkers if they want to
class Metal:
name = "aluminum"
def __init__(self):
self.name = arg
def meth(self, val):
self.v = val
pylint badcode.py
************* Module badcode
W: 8: Bad indentation. Found 6 spaces, expected 8
W: 0: Missing docstring
W: 0: Missing required attribute "__revision__"
W: 1:Metal: Missing docstring
E: 5:Metal.__init__: Undefined variable 'arg'
W: 7:Metal.meth: Missing docstring
R: 1:Metal: Not enough public methods (1/2)
W: 8:Metal.meth: Attribute 'v' defined outside __init__
PyChecker
- Imports the module (or modules)
- If the code doesn't run,
PyChecker can't analyze it
- Uses compiled byte codes as well as source when doing analysis
pychecker badcode.py
Warnings...
badcode.py:5: No global (arg) found
- Similar tools exist for [name of your language goes here]
- Most check for much more than just indentation and naming conventions
- Can also find:
- Dead code (i.e., variables and methods that are never used)
- Unhandled exceptions
- Duplicate code
- And much more
- Use them during development
- Don't check anything into version control unless it passes a style check
- Take their warnings very seriously when debugging
- If someone was sloppy enough to make style mistakes, they probably made more serious mistakes as well
- Use them when reading code
- Many tools can reformat code so that it obeys style guidelines
- Do not reformat code just because you can
- Makes it difficult to track and merge changes in version control
- But reformatting a copy to make it easier to read can save you headaches
- Comment on this slide
What About Documentation?
- In an ideal world, all code would be accurately documented
- And the Beatles would still be performing as a group…
- In this world, documentation is often missing
- Or worse, it may have been up to date three years ago, but no longer describes the software correctly
- Many kinds of documentation are useful
- Embedded in code to explain tricky bits
- “This is a while loop, instead of a for, because it may delete items from the list as it goes”
- Associated with class/method headers to explain what they're for
- Documentation for a library
- “This function returns the first pair of non-overlapping subsequences that match the input pattern, or null otherwise”
- Architectural descriptions
- The architecture of a program is what you draw on the whiteboard when explaining its internals to other people
- May show relationships between classes, major modules, data flow, etc.
- Often the most important documentation of all, since it's what gives other programmers a mental map of how everything fits together
- Requirements
- Description of what needs the software is supposed to meet
- “If more than 100 events arrive in a second, the seismograph interface must store them in a queue“ tells you to look for a seismograph interface, and a queue that feeds it data
- User guide
- As above, knowing what the software is supposed to do often helps figure out how it works
- It's increasingly fashionable to embed some documentation in code, then extract and format it
- Javadoc (for Java) translates specially-formatted comments into HTML
| Java | HTML |
|---|
/**
* Returns the least common ancestor of two species based on DNA
* comparison, with certainty no less than the specified threshold.
* Note that getConcestor(X, X, t) returns X for any threshold.
*
* @param left one of the base species for the search
* @param right the other base species for the search
* @param threshold the degree of certainty required
* @return the common ancestor, or null if none is found
* @see Species
*/
public Species getConcestor(Species left, Species right, float threshold) {
...implementation...
}
| <p><strong>getConcestor</strong></p>
<p><code>public Species getConcestor(Species left, Species right, float threshold)</code></p>
<blockquote>
<p>Returns the least common ancestor of two species based on DNA
comparison, with certainty no less than the specified threshold. Note
that getConcestor(X, X, t) returns X for any threshold.</p>
<p><strong>Parameters:</strong></p>
<blockquote>
<p><code>left</code> - one of the base species for the search
<p><code>right</code> - the other base species for the search
<p><code>threshold</code> - the degree of certainty required
</blockquote>
<p><strong>Parameters:</strong></p>
<blockquote>
<p>the common ancestor, or null if none is found</p>
</blockquote>
<p><strong>See Also:</strong></p>
<blockquote>
<p><code>Image</code></p>
</blockquote>
</blockquote>
|
- Python's
PyDoc module does similar things
- Embedded documentation is more likely to be up to date than external documentation
- It's right in front of programmers as they make changes
- Although that's no guarantee…
- Comment on this slide
Traceability
- Reproducibility is one of the things that makes science science
- Anyone can (at least in principle) repeat your experiment to verify your results
- Traceability is a key part of this in computational science
- What program(s) did you use?
- Where did your data come from, and what did you do to it?
- Always include version control information in source files
- Subversion can be told to look for strings like
$Revision: 421$ when you submit changes, and replace the text after the colon with the appropriate information - Allows the file to carry its identity with it when it's printed, archived, or emailed
- Usually put the keyword inside a string, so that version information is available inside the program
__version__ = "$Revision: 423 $"
if __name__ == '__main__':
print __version__
- Data files are source files too!
- Every original data file should go under version control, and have revision information stored in it
- Can record author and date of last change as well as version number
- Carry this information through when processing files
- E.g., input file
biomes.dat has a header $Revision: 421$ - Is processed by version 37 of
ecoanalyzer.py - So include the following header (as comments) in the output file:
# $Revision: $
# From: biomes.dat version 421
# By: ecoanalyzer.py version 37
# On: 2005-09-22 12:14:07 EST
- Comment on this slide
Executable Documentation
- Also increasingly fashionable to talk about executable documentation
- Things that describe the program in ways that the computer can check
- Assertions
- Assertions at the top of a method are usually checking pre-conditions that must be true for the method to run properly
- Those at the bottom are guarantees of what will be true after the method has run
- Unless there are
return statements embedded in the middle of the method… - …which is another reason they're a bad idea
- Assertions in the middle of methods are usually checking:
- Invariants (i.e., number of items in the
searched list, plus the number in the toSearch list, stays constant) - Old bugs: common to introduce assertions while debugging, then leave them in to detect the bug's reappearance
- When reading code, assertions let you compare your current understanding of the code with what it actually does
- If you see
assert x < y, and you don't know why x has to be less than y at that point, you've lost track
- So when writing code, add assertions:
- To detect errors and mis-use early
- To help the next person understand how the code is supposed to behave
- Tests
- Each test is really just a super-assertion
- Each test that's up to date and runs correctly, that is
- They are not a substitute for documentation
- Which is easier to read, 23 test cases, or one sentence explaining the relationship between a method's three Boolean input parameters?
- But they are a definitive answer in cases of ambiguity
- When writing, use test-driven development to ensure that tests exist, and are kept up to date as code changes
- When reading, look for tests that use the classes or functions you're interested in, and make sure you understand why the given input produces the checked-for output
- Build
- Every medium to large project has infrastructure to compile source files, create installers, run tests, etc.
- Knowing how the software is built can tell you a lot about what to look for
- Entire directories of source code may be ignored depending on the operating system you're on
- Different modules may be used depending on whether it's the Lite or Professional version
- Build control files are notoriously difficult to read
- Often easier to perform a build, and look at its output to see what files are involved, what flags are used, etc.
- When writing code, try to keep the build as simple as possible
- If any parts of the software are only conditionally included, make the conditions very prominent
- If you're linking non-standard libraries, make sure they're flagged too
- Comment on this slide
Active Reading
- Sometimes, the only way to understand a piece of code is to get your hands dirty
- Logging
- Running in a debugger
- Dynamic code inspection tools
- Most languages have logging libraries that let programmers control debugging output
- Typical call is
log("GRID", LOG_DEBUG, "Refining a %d X %d grid", dimX, dimY) - First argument specifies topic (sometimes called “channel”)
- Allows readers to filter messages (e.g., select only those related to grid updates)
- Second specifies level: only print message if user has asked for that much detail
| Level | Meaning |
|---|
| DEBUG | Debugging output (of interest to programmers only) |
| INFO | Normal operations (e.g., a user logging in) |
| WARN | Something that requires an administrator's attention (e.g., a user is out of disk space) |
| ERROR | Something wrong with the program that it can recover from (e.g., unable to open its own configuration file) |
| CRITICAL | System about to crash, missiles about to launch, yet another bad 1970s TV show being turned into a movie |
Table 15.1: Typical Logging Levels
- At start of program, specify where to send logging output
- To screen
- To a file
- To a file that “rolls over” when it reaches a certain size
- To a database
- Through a socket to a collector on some other machine
- Etc.
- When writing code, add logging statements to record important events
- Set to “no output” by default to avoid annoying users
- When program starts to behave strangely, have users turn on logging and send you the trace
- Encourages cleaner coding
- If there's no point at which you can print, “Rotation complete”, it'll be hard for readers to figure out which code is responsible for doing rotations
- Debuggers aren't just for finding bugs
- Stepping through code is often the easiest way to figure out how it works
- First pass: step over functions, rather than into them
- Then re-run, and dig down to the next level of detail
- Hardest part is often figuring out where to start
- Depends on the architecture
- Command-line applications: start with
main (or, in Python, __name__ == '__main__') - If the code plugs into a larger framework, set breakpoints on key methods, then examine the call stack to see how you got there
- Some debuggers are more graphical than others
- Data Display Debugger (DDD) can draw box-and-arrow diagrams of data structures
![[Data Display Debugger]](img/readstyle/ddd.jpg)
Figure 15.7: Data Display Debugger
(Courtesy of Andreas Zeller, The DDD Manual)
|
- Integrated Development Environments
- A good IDE is more than just a fancy editor and a symbolic debugger
- Good ones (like Eclipse) also include code browsers
- Keep track of where classes, methods, and variables are defined
- Single click to jump from a variable to its definition, then to the definition of its class
- Another click to search for variables of a certain class (or its subclasses)
- And much, much more
- Similar tools available for old-fashioned text editors
- Ctags parses source files and produces a database of the definitions they contain
- Single editor command jumps from use to definition
- Currently works for over 30 languages
- Comment on this slide
Summary
- The most efficient way to read code is almost always to:
- Print it out
- Paper has 4-10 times the resolution of even the best screen
- Sit somewhere comfortable
- Away from email, the web, your pager, and other distractions
- Find an entry point (like
main) - Trace execution from there, skipping over stuff that looks straightforward
- E.g., argument parsing and file I/O
- Put a question mark beside everything that doesn't immediately make sense
- Once you're done, go back and try to answer your questions
- This also happens to be a good way to do a code review
- Read a colleague's code in order to:
- Find bugs
- Make sure she is following style guidelines
- Make sure she implemented what you asked for
- Code reviews are a more effective way to find bugs than running ad hoc tests
- Which is yet another good reason to learn how to read code
- Comment on this slide
Watching Programs Run
Turing's Great Insight
- Alan Turing's great insight was that programs are just another kind of data
- Source code is text that can be parsed line by line, using regular expressions, etc.
- Compiled programs are just bytes in memory
- Reflection is the ability of software to examine itself as it runs
- What methods does this class contain?
- What lines of code have been executed?
- How much time is being spent executing each line of code?
- It may sound very abstract, but:
- Every modern programming language provides it (to some extent)
- Many important tools rely on it
- A growing number of applications use it to make themselves more modular and easier to extend
- Will briefly describe features of other languages in order to explain things
- Trust me, it's not as scary as it first appears
- Comment on this slide
Faking Objects
- If Python didn't have objects and classes, you could fake them using its other features
- Seeing how will give you a clearer picture of how Python's own objects work
- Use a dictionary to store an object's members
- Including its methods…
- …which are just functions that take the object as their first argument
- So for each class, write a function that creates such a dictionary
def makeRect(x0, y0, x1, y1):
'''Construct a rectangle 'object'.'''
return {'x0' : x0,
'y0' : y0,
'x1' : x1,
'y1' : y1,
'area' : rectArea,
'contains' : rectContains,
'str' : rectStr
}
- Now write the “methods”
- Remember, each one will be passed the object as its first argument
def rectArea(obj):
'''Calculate area of rectangle.'''
return (obj['x1'] - obj['x0']) * (obj['y1'] - obj['y0'])
def rectContains(obj, x, y):
'''See if a rectangle contains an XY point.'''
return (obj['x0'] <= x <= obj['x1']) and \
(obj['y0'] <= y <= obj['y1'])
def rectStr(obj):
'''Show as string.'''
return '[%(x0)d,%(y0)d X %(x1)d,%(y1)d]' % obj
- Write a function to run an arbitrary method
- Look for it in the object's dictionary
- Pass it the object, and any other parameters given by the user
- Hand the result back to the original caller
def run(obj, meth, *args, **kwargs):
'''Run a method for an 'object', passing in arbitrary arguments.'''
func = obj[meth]
return func(obj, *args, **kwargs)
- Note: this function works for all “classes”
- Finally, write some tests to show how it works
# Create a rectangle.
block = makeRect(0, 0, 2, 2)
print run(block, 'str')
# Calculate its area.
print 'area is', run(block, 'area')
# See if it contains a couple of points.
print 'contains (1, 1)?', run(block, 'contains', 1, 1)
print 'contains (10, 10)?', run(block, 'contains', 10, 10)
- This is actually not very far from what Python actually does
- Replace
run(obj, 'method', …) with obj.method(…) - Let programmers define classes to group related functions together
- I.e., to make it clear that they're methods
- Classes also help implement inheritance
- Instances of a child class get all the methods of their parent class(es)
- If the child class redefines the method, the object gets a reference to the new one, rather than the old one
- Comment on this slide
How Other Languages Do It
- Java's object model is more restrictive than Python's
- All methods and members must be defined when code is compiled
- Can't be modified at runtime
- Less flexible, but makes automatic optimization easier
- Object stores its own data members, and a reference to the class it belongs to
- At runtime, the class is just another object
- Its members are its methods
obj.meth(args) is obj.class[methodname](args)
- I.e., look up the class, look in it to find the method, and call that
- C++, Fortran-90, and other languages are all variations on the theme
- Key idea: it's all functions underneath
- Object orientation is “just” a way to figure out which one to call
- Comment on this slide
Runtime Tricks
- So, what's all this good for?
- Well, how do you think a debugger works?
- Doesn't know anything about your code in particular
- So it uses reflection to look things up as it needs them
- Still useful even if you don't intend to write a debugger
hasattr(obj, attrName) reports whether an object has an attribute with a particular namegetattr(obj, attrName) gets the value of an object's attributesetattr(obj, attrName, newValue) set an attribute's value- Yes, you can add attributes to particular objects after they have been created
dir(obj) lists all the attribute of an object- All of these are very useful when debugging interactively
- Python's
import statement is evaluated at runtime as well- Unlike C++'s
#include or Java's import, which are evaluated by the compiler
- Can therefore decide at runtime which version of a class to load
- If running a production system, load the real one
- If testing, load a fake version
import sys
if (len(sys.argv) > 1) and (sys.argv[1] == '-d'):
import testLib as lib
else:
import realLib as lib
print lib.ackermann(3, 2)
- Take this one step further, and use information in a file to control what modules Python loads and calls
- Key idea: everything you load dynamically must obey some kind of protocol
- A set of rules that allow two entities to communicate without knowing anything else about each other
- “Everything reads text from standard input, and writes text to standard output” is an example
- You have to know something in order to use code…
- Example: want to apply a series of filters to a set of strings
- Just like Unix command line tools
- Protocol: every module must define a function called
func that takes a list of strings as input, and produces another list of strings as output def func(lines):
result = []
for i in range(1, len(lines), 2):
result.append(lines[i])
return result
- Load modules dynamically using the
__import__ function- Does what the
import statement does, but takes a string as an argument - Note: must not include the trailing
.py
- The whole thing is surprisingly small
import sys
filters = []
for moduleName in sys.argv[1:]:
moduleName = moduleName[:-3] # strip off '.py'
module = __import__(moduleName)
func = getattr(module, 'func')
filters.append(func)
print filters
lines = sys.stdin.readlines()
for f in filters:
lines = f(lines)
for l in lines:
print l,
- This is how systems like the Apache web server work
- Modules being loaded are written in C, rather than Python
- Which modules to load is specified in a complex configuration file, rather than on the command line
- But the principles are the same
- Makes it very easy to add new components, swap real components for debugging versions, etc.
- Can equally well use it to control what mesh and solver classes to load, or which sequence comparison tests to import
- Comment on this slide
Coverage
- As well as looking at code as a bunch of (static) bytes, we can watch what happens as it executes
- Very useful for debugging
- Even more useful for performance tuning
- When Python translates your program into bytecode, it keeps track of where instructions came from
- It can then make a record of which lines of code have been executed
- Typically accumulate data over several runs of the program
- Allows you to see which bits of your program are no longer being used
- More importantly, which bits aren't being tested
- The
coverage tool does four things:
coverage -x yourprogram.py …args… runs your program to collect coverage data- Data is accumulated in a file called
.coverage so that you can collect information over several runs of your program
- Example:
import sys
def foo(x, y):
for i in x:
y += 1
print 'result:', y
def bar(z):
if z < 0:
print "shouldn't happen"
elif z == 0:
print "shouldn't happen either"
else:
foo('abc', z)
if __name__ == '__main__':
for i in range(1, int(sys.argv[1])):
bar(i)
python coverage.py -x c.py 3
result: 4
result: 5
python coverage.py -r -m c.py
Name Stmts Exec Cover Missing
-------------------------------------
c 14 12 85% 10, 12
python coverage.py -a
> import sys
> def foo(x, y):
> for i in x:
> y += 1
> print 'result:', y
> def bar(z):
> if z < 0:
! print "shouldn't happen"
> elif z == 0:
! print "shouldn't happen either"
> else:
> foo('abc', z)
> if __name__ == '__main__':
> for i in range(1, int(sys.argv[1])):
> bar(i)
- Coverage is most useful when integrated with unit tests
- Usually not economical to aim for 100% code coverage
- Tests for simple get/set methods are as likely to be buggy as the methods themselves…
- …and some things (like the network cable being kicked out at exactly the wrong moment) really are hard to automate
- But very useful to know whether code coverage is going up or down
- If someone checks in a large change set, and code coverage immediately drops from 80% to 70%, you probably want to have a close look at those changes
- Comment on this slide
Profiling
- Rather than just flagging lines as they are executed, we could count how often each line executes
- The more often something runs, the more important it is to figure out how to make it fast
- If execution time is what we really care about, better to measure it directly
- A statement that takes 10 milliseconds to execute is ten thousand times more expensive than one than takes a microsecond
- Profiling is the act of measuring aspects of a program's performance
- Typically measure execution time
- Can also measure disk usage, memory footprint, etc.
- Two approaches
- Instrument the code by inserting instructions to record the clock time
- Typically record time at start and end of each method call, and total up the differences
- Deterministic, but expensive
- Sample the code by interrupting it at regular intervals to see where it is
- Builds a statistical profile of where in the code we're most likely to be
- Non-deterministic, but impact can be reduced by reducing sampling frequency, and running the program for longer
- Both kinds of profiling distort the program's behavior
- The Heisenberg Principle: measuring something changes it
- Python's
hotshot module measures execution time- Create an instance of
hotshot.Profile with the name of a log file as an argument- Profile data accumulates in that file
- Use the profile object's
runcall method to profile a single run of a function- Which may of course call other functions
- To process the data, use
hotshot.stats
- Can load data from log files, sort it in various ways, print it, etc.
- Example: compare speed of 1000 handwritten substring function calls with built-in
import sys
def builtin(str, substr):
return str.find(substr)
def handwritten(str, substr):
for i in range(0, len(str)-len(substr)):
match = True
for j in range(len(substr)):
if substr[j] != str[i+j]:
match = False
break
if match:
return i
return -1
def tests(num):
Tests = [
['a', ''],
['a', 'abc'],
['abc', 'a'],
['pqrstuvwxyz', 'stuv'],
['pqrstuvwxyz', 'stuw']
]
for i in range(num):
for (str, substr) in Tests:
expected = builtin(str, substr)
assert expected == handwritten(str, substr)
if __name__ == '__main__':
import hotshot, hotshot.stats
Filename = 'substr.prof'
num = int(sys.argv[1])
prof = hotshot.Profile(Filename)
prof.runcall(tests, num)
prof.close()
stats = hotshot.stats.load(Filename)
stats.strip_dirs()
stats.sort_stats('time', 'calls')
stats.print_stats()
10001 function calls in 0.292 CPU seconds
Ordered by: internal time, call count
ncalls tottime percall cumtime percall filename:lineno(function)
5000 0.165 0.000 0.165 0.000 substr.py:6(handwritten)
1 0.081 0.081 0.292 0.292 substr.py:17(tests)
5000 0.045 0.000 0.045 0.000 substr.py:3(builtin)
0 0.000 0.000 profile:0(profiler)
- Hm…built-in method is only two and a bit times faster?
- What does it cost for an empty function call?
def builtin(str, substr):
return 0
def handwritten(str, substr):
return 0
10001 function calls in 0.166 CPU seconds
Ordered by: internal time, call count
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.098 0.098 0.166 0.166 empty.py:11(tests)
5000 0.034 0.000 0.034 0.000 empty.py:7(handwritten)
5000 0.034 0.000 0.034 0.000 empty.py:4(builtin)
0 0.000 0.000 profile:0(profiler)
- Take off overhead, and the difference is closer to 4X
- Still seems low, but at least now we have a way to investigate quantitatively
- Comment on this slide
Summary
- Major advances in programming come when languages start offering support for things the best programmers are doing anyway
- For loops and if/then/else formalized what good Fortran programmers were already doing
- Objects formalized the way good C, Pascal, and Lisp programmers managed their data structures and functions
- Java deserves credit for bringing two previously-esoteric practices into the mainstream
- Garbage collection: the computer recycles memory as it needs to
- Reflection: programs can inspect themselves at runtime
- Reflection simplifies the construction of large software systems
- Most big applications are now frameworks that load plug-in components dynamically
- A little extra effort…
- …but it forces programmers to really, truly modularize their code…
- …which also reduces maintenance and customization costs
- Watching programs run is an essential part of the software development process
- Which parts of my code can be thrown away?
- Which parts am I actually testing?
- Which parts of their work are my colleagues actually testing?
- Why is my program so slow?
- Modern computer systems are so complex that it's practically impossible to figure this out from first principles
- So write the code, profile, and then start tuning
- Comment on this slide
Exercises
Exercise 16.1:
What percentage of your code is tests? Is tested?
Exercise 16.2:
Can you honestly say that you write tests before code?
Find out how many tests currently pass or fail with a single
command? Identify the tests associated with a bug? Tell if your
code meets the team's standards?
Exercise 16.3:
Can you find out which functions use the most CPU time?
How long threads spend blocked on I/O? Who allocates memory, where,
for what? How accurate these numbers are? How today's profile
differs from last month's? How the profile differs across
machines?
Regular Expressions
Introduction
- How to count the blank lines in a file?
- Most people consider a line with just spaces and tabs to be blank
- But examining characters one by one is tedious
- More complex patterns (like telephone numbers or email addresses) are hard to get right
- Solution: use regular expressions (REs) instead
- Represent patterns as strings
- Just like the
"*" in the shell's *.txt
- Warning: the notation is ugly
- Have to use what's on the keyboard, instead of inventing new symbols the way mathematicians do
- Can't use superscripts and subscripts
- Comment on this slide
A Simple Example
- Load the
re module, then use re.search(pattern, text)
pattern is a regular expression that describes what you're looking fortext is the string you're searching in
- So what goes in
pattern?
- Letters, digits, and whitespace character match against themselves
- Punctuation characters have special meaning
| Pattern | Matches | Doesn't Match | Explanation |
|---|
⌈a*⌋ | "", "a", "aa", … | "A", "b" | ⌈*⌋ means “zero or more” matching is case sensitive |
⌈b+⌋ | "b", "bb", … | "" | ⌈+⌋ means “one or more” |
⌈ab?c⌋ | "ac", "abc" | "a", "abbc" | ⌈?⌋ means “optional” (zero or one) |
⌈[abc]⌋ | "a", "b", or "c" | "ab", "d" | ⌈[…]⌋ means “one character from a set” |
⌈[a-c]⌋ | "a", "b", or "c" | Character ranges can be abbreviated |
⌈[abc]*⌋ | "", "ac", "baabcab", … | Operators can be combined: zero or more choices from "a", "b", or "c" |
Table 17.1: Basic Regular Expression Operators
![[Matching]](img/re/option_match.png)
Figure 17.1: Matching |
re.search looks for a match anywhere in the text- Doesn't have to match the entire target string
import re
pattern = 'a[bc]*'
for text in ['b', 'ab', 'accb', 'mad']:
if re.search(pattern, text):
print '"%s" matches "%s"' % (pattern, text)
else:
print '"%s" does not match "%s"' % (pattern, text)
"a[bc]*" does not match "b"
"a[bc]*" matches "ab"
"a[bc]*" matches "accb"
"a[bc]*" matches "mad"
⌈a[bc]*⌋ matches an "a", followed by zero or more of either "b" or "c"
- Doesn't match
"b" because there's no leading "a" - Matches
"ab" and "accb"
- Why does it match
"mad"?
re.search looks for a match anywhere in text- Skips the
"m", then ⌈a⌋ matches "a", and ⌈[bc]*⌋ matches the empty string
- Comment on this slide
Anchoring
- If
re.search looks anywhere in the line, how to find blank lines?
- Don't consider
"x \n" or " x\n" blank
- Constrain what the RE can match using anchors
⌈^⌋ matches the beginning of the string⌈$⌋ matches the end- Neither consumes any characters
![[Anchoring Matches]](img/re/match_anchor.png)
Figure 17.2: Anchoring Matches |
- Examples
| Pattern | Text | Result |
|---|
⌈b+⌋ | "abbc" | Matches |
⌈^b+⌋ | "abbc" | Fails (string doesn't start with b) |
⌈c$⌋ | "abbc" | Matches (string ends with c) |
⌈^a*$⌋ | aabaa | Fails (something other than "a" between start and end of string) |
Table 17.2: Anchoring Regular Expressions
- Can now count blank lines in a file
import sys, re
# Nothing but space, tab, carriage return, newline from start to end
pattern = '^[ \t\r\n]*$'
# Count matches in one file/stream.
def count(filename, instream):
count = 0
for line in instream:
if re.search(pattern, line):
count += 1
print '%s %d' % (filename, count)
# Only standard instream?
if len(sys.argv) == 1:
count('<stdin>', sys.stdin)
else:
for filename in sys.argv[1:]:
instream = open(filename, 'r')
count(filename, instream)
instream.close()
- Note: always behave like a polite command line filter
- If no arguments given, read from standard input
- Comment on this slide
Escape Sequences
- How to match against a literal
"^" or "*"?
- Escape from the character's special meaning by putting
"\" in front of it ⌈\$⌋ matches a literal "$", and ⌈\\⌋ matches a literal "\"
- Must write these in Python as
"\\$" and "\\\\"
- Two layers of compilation:
- Python turn double backslashes into single backslash character
- Regular expression library then compiles single backslash plus something into special operation
![[Compiling Regular Expressions]](img/re/re_compile.png)
Figure 17.3: Compiling Regular Expressions |
- This can get very confusing, very quickly
"\t" is a tab character, which matches a tab character- But
"\\t" is the two-character sequence ⌈\t⌋, which also matches a tab character
-
"\" is also used in shorthand notation for common character sets| Sequence | Equivalent | Explanation |
|---|
⌈\d⌋ | ⌈[0-9]⌋ | Digits |
⌈\w⌋ | ⌈[a-zA-Z0-9_]⌋ | Word characters (i.e., those allowed in variable names) |
⌈\s⌋ | ⌈[ \t\r\n]⌋ | Whitespace |
Table 17.3: Regular Expression Escape Sequences
- A couple of useful special cases:
- The notation
⌈[^abc]⌋ means “anything except the characters in this set” ⌈.⌋ means “any character except the end of line”
⌈\b⌋ anchors the match to a break between word and non-word characters- Like
⌈^⌋ and ⌈$⌋, doesn't consume any actual characters ![[Word/Non-Word Breaks]](img/re/word_nonword_break.png)
Figure 17.4: Word/Non-Word Breaks |
- Comment on this slide
Extracting Matches
- Problem: to check for duplicate line numbers in an assembly language file
- Line numbers are the first thing on the line, terminated by a colon
load D 1
10 : sub A B
jlt A 20
- First try: search each line with a regular expression
- If it succeeds, extract the line number
import sys, re
# start of line, optional spaces, digits, more optional spaces, colon
numbered = '^\\s*\\d+\\s*:'
seen = {}
for line in sys.stdin:
if re.search(numbered, line):
num = line.split()[0]
if num in seen:
print num
else:
seen[num] = True
- But what if there is no space between the line number and the colon?
'2 :'.split() gives ['2', ':'], but '2:'.split()' gives ['2:']- Don't want to search the string to find the longest leading sequence of digits
- Need a better way to extract the text that matched sub-patterns
- Result of
re.search is actually a match object that records what what matched, and wheremo.group() returns the whole string that matched the REmo.start() and mo.end() are the indices of the match's locationimport re
text = 'abbcb'
for pattern in ['b+', 'bc*', 'b+c+']:
mo = re.search(pattern, text)
print '%s / %s => "%s" (%d, %d)' % \
(pattern, text, mo.group(), mo.start(), mo.end())
b+ / abbcb => "bb" (1, 3)
bc* / abbcb => "b" (1, 2)
b+c+ / abbcb => "bbc" (1, 4)
- Every parenthesized subexpression in the RE is a group
- Group 0 is the entire match
- Text that matched Nth parentheses (counting from left) is group N
mo.group(3) is the text that matched the third subexpression, m.start(3) is where it started
- Extracting line numbers is now easy:
import sys, re
# start of line, optional spaces, digits, more optional spaces, colon
numbered = '^\\s*(\\d+)\\s*:'
seen = {}
for line in sys.stdin:
mo = re.search(numbered, line)
if mo:
num = mo.group(1)
if num in seen:
print num
else:
seen[num] = True
- So is reversing two columns of numbers:
# optional spaces, number, required spaces, number, optional spaces
def reverse(instream, outstream):
cols = '^\\s*(\\d+)\\s+(\\d+)\s*$'
for line in instream:
mo = re.match(cols, line)
# If match, reverse numbers
if mo:
a, b = mo.group(1), mo.group(2)
print >> outstream, '%s\t%s' % (b, a)
# If no match, echo line (without adding extra newline at end)
else:
print >> outstream, line,
- Let's not forget to test it:
if __name__ == '__main__':
fixture = '''\
# Leading comment followed by blank line
10 20
30\t40\t
50
60 70 80
\t90 100
'''
expected = '''\
# Leading comment followed by blank line
20\t10
40\t30
50
60 70 80
100\t90
'''
from cStringIO import StringIO
instream = StringIO(fixture)
outstream = StringIO()
reverse(instream, outstream)
assert outstream.getvalue() == expected
- Comment on this slide
Compiling
- The regular expression library compiles patterns into a more concise form for matching
- Each regular expression becomes a finite state machine
- Library follows the arcs in the FSM as it reads characters
- Drawing FSMs is a good way to debug REs
![[Regular Expressions as Finite State Machines]](img/re/re_fsm.png)
Figure 17.5: Regular Expressions as Finite State Machines |
- You can improve a program's performance by compiling the RE once, and re-using the compiled form
- Use
re.compile(pattern) to get the compiled RE - Its methods have the same names and behavior as the functions in the
re module - E.g.,
matcher.search(text) searches text for matches to the RE that was compiled to create matcher
- Example: find all Title Case words in a document
def findAll(instream, outstream):
matcher = re.compile('\\b([A-Z][a-z]*)\\b(.*)')
for line in instream:
mo = matcher.search(line)
while mo:
print >> outstream, mo.group(1)
mo = matcher.search(mo.group(2))
- Notice how the function gets all matches:
- Pattern captures what we want in group 1, and everything else on the line in group 2
- Each time there's a match, continue the search in the remainder captured in group 2
- Tests are straightforward
if __name__ == '__main__':
fixture = '''\
This has several "Title Case" words
on Each Line (Some in parentheses).
'''
expected = '''\
This
Title
Case
Each
Line
Some
'''
from cStringIO import StringIO
instream = StringIO(fixture)
outstream = StringIO()
findAll(instream, outstream)
assert outstream.getvalue() == expected
print 'INPUT'
print fixture
print 'OUTPUT'
print expected
import re
#- start:findAll
def findAll(instream, outstream):
matcher = re.compile('\\b([A-Z][a-z]*)\\b(.*)')
for line in instream:
mo = matcher.search(line)
while mo:
print >> outstream, mo.group(1)
mo = matcher.search(mo.group(2))
#- end:findAll
# start:test
if __name__ == '__main__':
fixture = '''\
This has several "Title Case" words
on Each Line (Some in parentheses).
'''
expected = '''\
This
Title
Case
Each
Line
Some
'''
from cStringIO import StringIO
instream = StringIO(fixture)
outstream = StringIO()
findAll(instream, outstream)
assert outstream.getvalue() == expected
print 'INPUT'
print fixture
print 'OUTPUT'
print expected
# end:test
- Compiled REs have many other useful methods
- Including one that finds all matches, so you don't have to write the loop
- All are also available as top-level functions in the
re module | Method | Purpose | Example | Result |
|---|
split | Split a string on a pattern. | re.split('\\s*,\\s*', 'a, b ,c , d') | ['a', 'b', 'c', 'd'] |
findall | Find all matches for a pattern. | re.findall('\\b[A-Z][a-z]*', 'Some words in Title Case.') | ['Some', 'Title', 'Case'] |
sub | Replace matches with new text. | re.sub('\\d+', 'NUM', 'If 123 is 456') | "If NUM is NUM" |
Table 17.4: Regular Expression Object Methods
- Comment on this slide
Using REs in Other Languages
- Like sine and sorting, regular expressions are independent of language
- RE libraries exist for C/C++, Java, Perl, Ruby, MATLAB, …
- Syntax varies slightly, but the ideas are the same
- Example: Java's
java.util.regex package contains two classes:
Pattern: a compiled regular expressionMatcher: the result of a match- Typical usage:
public static String matchMiddle(String data) {
String result = null;
Pattern p = Pattern.compile("a(b|c)d");
Matcher m = p.matcher(data);
if (m.matches()) {
result = m.group(1);
}
return result;
}
- REs are actually built into languages like Perl and Ruby
- Just as dictionaries are built into Python, and arrays into MATLAB
- Typical Perl usage:
open MAIL, 'mail.txt'
while (<MAIL>) {
if (($name, $value) = /^([^:]+): ?(.+)$/) {
print "Message header $name is $value\n";
}
}
- Comment on this slide
But Wait, There's More
- We've only scratched the surface
- Regular expressions have proved to be too useful to remain clean and elegant
- Use
⌈|⌋ for either/or⌈ab|cd⌋ matches either "ab" or "cd"⌈a(b|c)d⌋ matches either "abd" or "acd"
- Use
⌈pat{N}⌋ to match exactly N occurrences of a pattern- More generally,
⌈pat{M,N}⌋ matches between M and N occurrences ⌈\d{2,3}⌋ matches "19" or "207", but not "3" or "4567"
- Well, OK, it will match the first three characters of
"456" - But
⌈^\d{2,3}⌋ won't
- Most important thing is to build up complex REs one step at a time
- Write something that matches part of what you're looking for
- Test it
- Add to it
- That's why they call it computer science: it's experimental
- For a broader tutorial, see
[Wilson 2005]
- Comment on this slide
Exercises
Exercise 17.1:
By default, regular expression matches are
greedy: the first term in the RE
matches as much as it can, then the second part, and so on. As a
result, if you apply the RE ⌈X(.*)X(.*)⌋ to the string
"XaX and XbX", the first group will contain "aX and Xb",
and the second group will be empty.
It's also possible to make REs match
reluctantly, i.e., to have the
parts match as little as possible, rather than as much. Find out
how to do this, and then modify the RE in the previous paragraph
so that the first group winds up containing "a", and the
second group " and XbX".
Basic XML and XHTML
Overview
- XML is quickly becoming the standard way to store data
- Web pages
- Spreadsheets
- Images
- Astronomical observations
- Bewildering variety of tools for dealing with it
- And more appearing every day
- Even simple things can be tricky to do right
- Generally agreed that standards are more complex than they should have been…
- …but no agreement on which bits should have been left out
- Examine:
- The rules for creating legal XML
- The XML-compliant dialect of HTML (for web pages)
- The standard Python library for manipulating XML
- Note: there are lots of others, each with its strengths
- Reading:
- Comment on this slide
History
- 1969-1986: Standard Generalized Markup Language (SGML)
- Developed by Charles Goldfarb and others at IBM
- A way of adding information to medical and legal documents so that computers could process them
<person role="litigant">
<given-name>Charles</given-name>
<surname>Babbage</surname>
</person>
- Allows computers to reliably find cases in which Charles Babbage sued someone
- Large (500-page spec) and complex
- 1989: Tim Berners-Lee creates HyperText Markup Language (HTML) for the World Wide Web
- Much (much) simpler than SGML
- Anyone could write it, so everyone did
- Problem: HTML had a small, fixed set of tags
- Everyone wanted to add new ones
- Solution: create a standard way to define a set of tags, and the relationships between them
- 1998: first version of XML is standardized
- A set of rules for defining markup languages
- Much more complex than HTML…
- …but still much simpler than SGML
- New version of HTML called XHTML was also defined
- Like HTML, but obeys all XML rules
- Still a lot of non-XML compliant HTML out there, though
- Comment on this slide
Formatting Rules
- For our purposes, an XML document contains elements and text
- Full spec allows for external entity references, processing instructions, and other fun
- Elements are shown using tags
- Must be enclosed in angle brackets
"<>" - Full form:
<tagname>…</tagname> - Short form (if the element doesn't contain anything):
<tagname/>
- Elements must be properly nested
- If Y starts inside X, Y must end before X ends
- So
<X>…<Y>…</Y></X> is legal… - …but
<X>…<Y>…</X></Y> is not
- Every document must have a single root element
- I.e., a single element must enclose everything else
- So the following is not a legal document
<first>
This document is <em>illegal</em>
</first>
<second>
because it does not have a unique root element.
</second>
- Text is normal printable text
- Must use escape sequences to represent
"<" and ">" - In XML, written
&name;
| Sequence | Character |
|---|
< | < |
> | > |
" | " |
& | & |
Table 18.1: Common XML Escape Sequences
- Specific dialects of XML may or may not restrict which elements can appear inside which others, and where text can appear
- XHTML is very liberal
- MathML (Mathematical Markup Language) is stricter
- Comment on this slide
XHTML
- Most common use of XML is still XHTML (the XML version of hypertext)
- Basic tags:
html | Root element of entire HTML document. |
|---|
body | Body of page (i.e., visible content). |
|---|
h1 | Top-level heading. Use h2, h3, etc. for second- and third-level headings. |
|---|
p | Paragraph. |
|---|
em | Emphasized text; browser or editor will usually display it in italics. |
|---|
address | Address of document author (also usually displayed in italics). |
|---|
Table 18.2: Common XHTML Tags
- Note: XHTML includes both semantics (“What does this mean?”) and display (“How should this be drawn?”)
h1 (level-1 heading) is semantic, i (italics) is display- Now generally considered a bad thing
- Documents should only contain semantics
- Display of that semantics should be specified separately…
- …so that different browsers (or devices) can do it differently
- Sample document:
<html>
<body>
<h1>Software Carpentry</h1>
<p>This course will introduce <em>essential software development skills</em>,
and show where and how they should be applied.</p>
<address>Greg Wilson (gvwilson@third-bit.com)</address>
</html>
|
![[Simple Page Rendered by Firefox]](img/xml/simple_page_firefox.png)
Figure 18.1: Simple Page Rendered by Firefox |
|
![[Simple Page Rendered by Internet Explorer]](img/xml/simple_page_ie.png)
Figure 18.2: Simple Page Rendered by Internet Explorer |
|
- Comment on this slide
Attributes
- Elements may also have attributes
- Each attribute is a name/value pair that provides extra information about the element
- Enclosed in the opening tag
<h1>A Centered Heading</h1><p>This planet provided as-is.</p>
- Each name may appear at most once
- Like keys in a dictionary
<p>…</p> is illegal
- Values must be quoted
- Old-style HTML often allowed things like
<p>…</p>, but modern parsers will reject it - Must use escape sequences for angle brackets, quotes, etc. inside values
- Strictly speaking, attributes are redundant
- Can always re-write XML using only elements
- Usually more typing…
- …but that only matters if you're creating the XML by hand…
- …and an increasing amount is created by machines, for machines
| With Attributes | Without Attributes |
|---|
<a b="c">
<d e="f"/>
</a>
|
<a>
<a-b>c</a-b>
<d><d-e>f</d-e></d>
</a>
|
- You should use attributes when:
- Each value can occur at most once for any element.
- The order of the values doesn't matter.
- Those values have no internal structure.
- If you have to do any significant work on an attribute's value to figure out what it means, you should use an element instead.
- Comment on this slide
More XHTML Tags
- Well-written HTML pages have a
head element as well as a body
- Contains metadata about the page
| Element | Example | Purpose |
|---|
title | <title/> | Page title (for display in title bar of browser, bookmarks, etc.) |
meta | <meta/> | Information about the document (typically, to help with search) |
Table 18.3: More XHTML Tags
- Well-written pages also use comments (just like code)
- Introduce with
<!--, and end with --> <html>
<head>
<title>Comments Page</title>
<meta name="author" content="aturing"/>
</head>
<body>
<!-- House style puts all titles in italics -->
<h1><em>Welcome to the Comments Page</em></h1>
<!-- Update this paragraph to describe the forum. -->
<p>Welcome to the Comments Forum.</p>
</body>
</html>
- Many other tags can be used (and abused) in HTML pages
- Use
ul for an unordered (bulleted) list, and ol for an ordered (numbered) one- Each list item is wrapped in
li
- Use
table for tables- Each row is wrapped in
tr (for “table row”) - Within each row, column items are wrapped in
td (for “table data”) - Note: tables are often used to force multi-column layout, as well as for tabular data
<html>
<head>
<title>Lists and Tables</title>
<meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/>
</head>
<body>
<table cellpadding="3" border="1">
<tr>
<td align="center"><em>Unordered List</em></td>
<td align="center"><em>Ordered List</em></td>
</tr>
<tr>
<td align="left" valign="top">
<ul>
<li>Hydrogen</li>
<li>Lithium</li>
<li>Sodium</li>
<li>Potassium</li>
<li>Rubidium</li>
<li>Cesium</li>
<li>Francium</li>
</ul>
</td>
<td align="left" valign="top">
<ol>
<li>Helium</li>
<li>Neon</li>
<li>Argon</li>
<li>Krypton</li>
<li>Xenon</li>
<li>Radon</li>
</ol>
</td>
</tr>
</table>
</body>
</html>
| ![[Lists and Tables]](img/xml/list_table_firefox.png)
Figure 18.3: Lists and Tables |
|
- Note how Subversion keywords have been put in
meta elements in document head- Automatically updated each time the document is committed to version control
- Comment on this slide
Connecting to Other Data
- How to put an image in a page?
- XML documents can only contain text, so you can't store an image or audio clip directly in a page
- Unless you encode it as text
- Usual solution is to store a reference to the external file using the
img tag- The
src argument specifies where to find the image file
<html>
<head>
<title>Images</title>
<meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/>
</head>
<body>
<h1>Our Logo</h1>
<img src="../../img/swc_logo.jpg" alt="[Software Carpentry Logo]"/>
</body>
</html>
| ![[Images in Pages]](img/xml/image.png)
Figure 18.4: Images in Pages |
|
- Important to always use the
alt attribute to specify alternative text- Screen readers for people with visual handicaps use this instead of the image
- And it's good documentation for search engines
- Often, the “other data” you want to connect to is other HTML pages
- This is what makes it hypertext…
- Use the
a element to create a link- The text inside the element is displayed and (usually) underlined for clicking
- The
href attribute specifies what the link is pointing at - Both local filenames and URLs are supported
<html>
<head>
<title>Links</title>
<meta name="svn" content="$Id: xml.swc 54 2005-04-13 13:29:28Z gvwilson $"/>
</head>
<body>
<h1>A Few of My Favorite Places</h1>
<ul>
<li><a href="http://www.google.com">Google</a></li>
<li><a href="http://www.python.org">Python</a></li>
<li><a href="http://www.nature.com/index.html">Nature Online</a></li>
<li>Examples in this lecture:
<ul>
<li><a href="comments.html">Comments</a></li>
<li><a href="image.html">Images</a></li>
<li><a href="list_table.html">Lists and Tables</a></li>
</ul>
</li>
</ul>
</body>
</html>
| ![[Links in Pages]](img/xml/links.png)
Figure 18.5: Links in Pages |
|
- Comment on this slide
Accessibility
- The web is not a particularly friendly place if you're visually disabled
- Screenreaders have a hard time dealing with web pages that use graphics instead of text for buttons…
Top 10 Accessible Web Authoring Practices describes what you should do to make your pages more accessible- All of these things help search engines and other automatic tools as well
- Comment on this slide
The Document Object Model
- The Document Object Model (DOM) is a cross-language standard for representing XML documents as trees
- Elements, attributes, and text all represented as objects
- Strengths:
- Much easier to manipulate trees than strings
- Same basic model in many different languages (which lowers the learning cost)
- Weaknesses:
- Needs a lot of memory for large documents
- Its generic model doesn't take advantage of the more advanced features of some languages
- Most popular alternative is SAX (the Simple API for XML)
- Turns an XML document into a stream of events
- “Element, element, text, element, text…
- Easy to do very simple things…
- …but anything complex requires the programmer to reimplement a subset of DOM
- Python comes with a simple implementation of DOM called
minidom
- Fast, sturdy, and well documented…
- …if you understand all the terminology, and know more or less what you're looking for)
- Comment on this slide
The Basics
- Every DOM tree has a single root representing the document as a whole
- Doesn't correspond to anything that's actually in the document
- This element has a single child, which is the root node of the document
- This node, and other element nodes, may have three types of children:
- Other elements
- Text nodes
- Attribute nodes
- Every node keeps track of what its parent is
- Allows programs to search up the tree, as well as down
- Example:
| XML | DOM Tree |
|---|
<root>
<first>element</first>
<second attr="value">element</second>
<third-element/>
</root> | ![[A DOM Tree]](img/xml/dom_tree.png)
Figure 18.6: A DOM Tree |
|
|---|
- Note: it's common to forget that text and attributes are stored in nodes of their own
- Some other Python implementations of DOM don't bother
- Make simple things simpler…
- …but only a little bit
- Comment on this slide
Creating a Tree
- Usual way to create a DOM tree is to parse a file
- If this is the file:
<?xml version="1.0" encoding="utf-8"?>
<planet name="Mercury">
<period units="days">87.97</period>
</planet>
- Parse and print like this:
import xml.dom.minidom
doc = xml.dom.minidom.parse('mercury.xml')
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<planet name="Mercury">
<period units="days">87.97</period>
</planet>
- The
toxml method can be called on the document, or on any element node- Note that we specify
"utf-8" as the character encoding - DOM trees always store text as Unicode, so when you're converting the tree to text, you must tell the library how to represent characters
- Can also create a tree by parsing a string
- Works just like parsing a file
import xml.dom.minidom
src = '''<planet name="Venus">
<period units="days">224.7</period>
</planet>'''
doc = xml.dom.minidom.parseString(src)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<planet name="Venus">
<period units="days">224.7</period>
</planet>
- Finally, can build a tree by hand
import xml.dom.minidom
impl = xml.dom.minidom.getDOMImplementation()
doc = impl.createDocument(None, 'planet', None)
root = doc.documentElement
root.setAttribute('name', 'Mars')
period = doc.createElement('period')
root.appendChild(period)
text = doc.createTextNode('686.98')
period.appendChild(text)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<planet name="Mars"><period>686.98</period></planet>
xml.dom.minidom is really just a wrapper around other platform-specific XML libraries- Have to reach inside it and get the underlying implementation object to create the
document node - That node then knows how to create other elements in the document
- Library explains what the first and third arguments to
createDocument are - Middle one tells
createDocument what type of element the document's root node should be
- Set attributes of element nodes using
setAttribute(attributeName, newValue)
- Remember, all attribute values are strings
- If you want to store an integer or a Boolean, you have to convert it yourself
- Add new nodes to existing ones by:
- Asking the document to create the node
- Appending it to a node that's already part of the tree
- Notice that the output of the preceding example wasn't nicely indented
- We didn't tell DOM to create text nodes containing carriage returns and blanks
- Most machine-generated XML doesn't
- Comment on this slide
Walking a Tree
- Often want to visit each node in the tree
- E.g., print an outline of the document showing element nesting
- Simplest way is to write a recursive function
import xml.dom.minidom
src = '''<solarsystem>
<planet name="Mercury"><period units="days">87.97</period></planet>
<planet name="Venus"><period units="days">224.7</period></planet>
<planet name="Earth"><period units="days">365.26</period></planet>
</solarsystem>
'''
def walkTree(currentNode, indent=0):
spaces = ' ' * indent
if currentNode.nodeType == currentNode.TEXT_NODE:
print spaces + 'TEXT' + ' (%d)' % len(currentNode.data)
else:
print spaces + currentNode.tagName
for child in currentNode.childNodes:
walkTree(child, indent+1)
doc = xml.dom.minidom.parseString(src)
walkTree(doc.documentElement)
solarsystem
TEXT (1)
planet
period
TEXT (5)
TEXT (1)
planet
period
TEXT (5)
TEXT (1)
planet
period
TEXT (6)
TEXT (1)
- Node's type is stored in a member variable called
nodeType
ELEMENT_NODE, TEXT_NODE, ATTRIBUTE_NODE, DOCUMENT_NODE
- If a node is an element, its children are stored in a list called
childNodes
- A read-only structure
- See how to add, delete, and move children in a moment
- If a node is a text node, the text is in the member
data
- The single-character text nodes are the carriage returns separating elements
- Traversing a tree like this is just one of many recurring patterns in object-oriented programming
- The Visitor pattern is used to separate traversal of a data structure from operations on its elements
- One class traverses a particular kind of structure the same way each time
- User then defines the operation
- Derive a class, pass a function as an argument, etc.
- The fact that this can be done in several different ways is what makes it a pattern
- Step 1: define generic behavior
class Visitor(object):
def __init__(self):
pass
def visit(self, node):
# When given the document, skip to the root.
if node.nodeType == node.DOCUMENT_NODE:
self.visit(node.documentElement)
return
# Handle other types of nodes.
self.before(node)
self.at(node)
if node.nodeType == node.ELEMENT_NODE:
for child in node.childNodes:
self.visit(child)
self.after(node)
def doNothing(self, node):
pass
before = doNothing
at = doNothing
after = doNothing
- Users call
Visitor.visit with the root node of the tree they want to traverse - Override one or more of the three do-nothing methods to perform actions before, at, or after nodes
- Step 2: derive a class that does something useful (like count how many nodes are in the tree)
class Counter(Visitor):
def __init__(self):
Visitor.__init__(self)
self.count = 0
def at(self, node):
if node.nodeType == node.ELEMENT_NODE:
self.count += 1
- Initialize
count to zero before traversing - Increment the count at each element node
- Step 3: test
if __name__ == '__main__':
src = '<a><b>c</b><d>e</d><f>g<h/>i</f></a>'
tree = xml.dom.minidom.parseString(src)
c = Counter()
c.visit(tree)
assert c.count == 5
- Comment on this slide
Modifying the Tree
- Modifying trees in place is a little bit tricky
- Helps to draw lots of pictures
- Example: want to emphasize the first word of each paragraph
- Get the text node below the paragraph
- Take off the first word
- Insert a new
<em/> element whose only child is a text node containing that word ![[Modifying the DOM Tree]](img/xml/modify_tree.png)
Figure 18.7: Modifying the DOM Tree |
- Ah, but it's not that simple
- What if the first child of the paragraph already has some markup around it?
- E.g., what if the paragraph starts with a link?
- Could just wrap the first child with
<em/>
- But if (for example) the link contains several words, this will look wrong
- We'll ignore this problem for now
- First part of solution: find all the paragraphs using
getElementsByTagName, and iterate over them- You'll use this method a lot…
def emphasize(doc):
paragraphs = doc.getElementsByTagName('p')
for para in paragraphs:
first = para.firstChild
if first.nodeType == first.TEXT_NODE:
emphasizeText(doc, para, first)
- Second part: break the paragraph text into pieces, and handle each piece in turn
- Create a new node for each piece
- Push it onto the front of the paragraph's child list
- Once they've all been handled, get rid of the original text node
def emphasizeText(doc, para, textNode):
# Look for optional spaces, a word, and the rest of the paragraph.
m = re.match(r'^(\s*)(\S*)\b(.*)$', str(textNode.data))
if not m:
return
leadingSpace, firstWord, restOfText = m.groups()
if not firstWord:
return
# If there's text after the first word, re-save it.
if restOfText:
restOfText = doc.createTextNode(restOfText)
para.insertBefore(restOfText, para.firstChild)
# Emphasize the first word.
emph = doc.createElement('em')
emph.appendChild(doc.createTextNode(firstWord))
para.insertBefore(emph, para.firstChild)
# If there's leading space, re-save it.
if leadingSpace:
leadingSpace = doc.createTextNode(leadingSpace)
para.insertBefore(leadingSpace, para.firstChild)
# Get rid of the original text.
para.removeChild(textNode)
- Third part: test it
- Yes, it really is part of the program
if __name__ == '__main__':
src = '''<html><body>
<p>First paragraph.</p>
<p>Second paragraph contains <em>emphasis</em>.</p>
<p>Third paragraph.</p>
</body></html>'''
doc = xml.dom.minidom.parseString(src)
emphasize(doc)
print doc.toxml('utf-8')
<?xml version="1.0" encoding="utf-8"?>
<html><body>
<p><em>First</em> paragraph.</p>
<p><em>Second</em> paragraph contains <em>emphasis</em>.</p>
<p><em>Third</em> paragraph.</p>
</body></html>
- Comment on this slide
Summary
- There's a lot of hype in hypertext
- Haven't yet heard anyone claim that XML will cure the common cold, but I'm sure it's been said
- Strengths:
- One set of rules for people to learn
- One parser can handle all of their data
- At least, the low-level syntactic bits—still need to figure out what all those tags mean
- Weaknesses:
- Raw XML is hard to read
- Particularly if it has been generated by a machine
- A lot of data isn't actually trees
- When storing a 2D matrix or a table, you have to organize data by row or by column…
- …either of which makes the other hard to access
- There are a lot of complications and subtleties
- Most applications ignore most of them
- Which means that they fail (usually badly) when confronted with something outside the subset they understand
- Like Inglish speling, it's here to stay
- Comment on this slide
A Mini-Project
Eating Your Own Cooking
- Been talking since the first lecture about the importance of tools, and about building your own to automate repetitive tasks
- This lecture takes a look at some of the tools used to manage the raw material for this course
- Show technologies like XML and regular expressions in action
- Show how to design something simple, and grow it over time
- Starting point: lecture slides are written in a simple XML format
- Root of each lecture is a
<lec/> element with title and id attributes - May contain one or more
<topic/> elements- Must have
title attribute - May optionally have a
summary attribute (used to construct the syllabus)
- Topics contain one or more
<slide/> elements- These contain
<b1/> (for “bullet level 1”), which contain <b2/>, and so on - May also contain tables, images, and code fragments
- Our task is to validate these files to make sure that:
- They contain only printable characters
- Tabs haven't been used for indentation
- The ID in the root
<lec/> element matches the filename - All the external files the lecture references (such as images and sample code) exist
- Solution: write a command-line utility that:
- Takes a list of filenames, along with
- Some command-line flags specifying which checks to run (or omit), and
- Reports any problems
- You probably won't ever need to do this for lecture slides…
- …but at some point, you probably will want to check the integrity of data files, experimental results, etc.
- Comment on this slide
Checking for Tabs
- Start with the simplest task: checking for tabs in files
- Open each file in turn
- Check each line for tabs
- Note: could also read entire contents of file into a string, and search for tabs in that
- If any found, print message
#!/usr/bin/env python
'''Check for tabs in one or more files.'''
import sys
def checkTabs(filename):
'''Look for tabs.'''
infile = open(filename, 'r')
for line in infile.readlines():
if line.find('\t') >= 0:
print '%s contains tabs' % filename
break
infile.close()
if __name__ == '__main__':
for filename in sys.argv[1:]:
checkTabs(filename)
- Great—except it only works on files
- Can't pipe things to it, because it doesn't know how to read standard input
- Implement the standard Unix convention: if no filenames provided, read from standard input
- Hm…
sys.stdin is an already-open file, not a filename - Change signature of
checkTabs to take both the filename, and an open file- Move the
open and close to the main body
#!/usr/bin/env python
'''Check for tabs in one or more files, or on standard input.'''
import sys
def checkTabs(filename, infile):
'''Look for tabs.'''
for line in infile.readlines():
if line.find('\t') >= 0:
print '%s contains tabs' % filename
break
if __name__ == '__main__':
if len(sys.argv) == 1:
checkTabs('<stdin>', sys.stdin)
else:
for filename in sys.argv[1:]:
infile = open(filename, 'r')
checkTabs(filename, infile)
infile.close()
- Great—except it doesn't report errors like missing or unreadable files
- Printing a stack trace doesn't count
- Fix by wrapping the code in an exception handler
- Only catch the kinds of exceptions we think are reasonable to expect
- Don't want the error handling to mask errors that we didn't anticipate
#!/usr/bin/env python
'''Check for tabs in one or more files, or on standard input, and
report errors.'''
import sys
def checkTabs(filename, infile):
'''Look for tabs.'''
for line in infile.readlines():
if line.find('\t') >= 0:
print '%s contains tabs' % filename
break
if __name__ == '__main__':
try:
if len(sys.argv) == 1:
checkTabs('<stdin>', sys.stdin)
else:
for filename in sys.argv[1:]:
infile = open(filename, 'r')
checkTabs(filename, infile)
infile.close()
except IOError, e:
print >> sys.stderr, e
- Note: could equally well put the exception handler:
- Inside the
else (since we don't think I/O errors can happen while reading standard input) - Inside the
for (so that if an error occurs while reading one file, the program continues on to the next)
- Comment on this slide
Running Tools
- Now, how to run the validation tool?
python check_tabs.py file1 fil2 file… will work…- …but typing in a bunch of filenames every time would be annoying
- Which means that we wouldn't do it as often as we should
- And we know we're going to have other validation tools to run as well
- Put the command in a Makefile
- If it's worth doing again, it's worth automating
- Directory structure of the course:
- A root directory
lec for lecture notes (in .swc files)util for utility programs (like the validation tools)img for images- Images for
lec/xyz.swc go in the img/xyz directory
src for source codesrc/xyz holds sample files for the XYZ lecture
- Makefile runs tab checker on all
.swc files# Re-make everything used in the Software Carpentry course.
all :
@echo 'options: clean validate'
clean :
@rm -f *~ */*~ */*/*~
validate :
@python util/check_tabs.py lec/*.swc
- Note: also added a
clean target that gets rid of editor backup files ending in ~ - And a default target called
all, that lists the things the Makefile can do - Remember, the
"@" in front of the commands means, “Don't echo the command before running it”
- Comment on this slide
Checking for Printable Characters
- Next on the list: make sure that files only contain printable characters
- I.e., whitespace, alphanumeric, and punctuation
- Some editors insert “smart” (curly) quotes, automatically convert
"---" into "—", etc.
- But other editors can't display these
- So disallow them, and require authors to use XML escape sequences for anything special
- Solution: make sure every character in the file is in
string.printable
- Contains letters, digits, spaces, and common punctuation
- Easier to use this than to write our own regular expression
- Should we add this function to the existing validation program, or create a second program?
- The former lets the checking functions share the code that opens files, handles errors, etc.
- The latter makes it easier to re-use the pieces separately
- Solution: put them in the same program, and provide command-line flags to disable certain tests
- Might be philosophically purer to have the flags turn tests on…
- …but the normal case is going to be run them all
- Parse command-line options using the
getopt module- First argument is the list of command-line arguments to parse
- Not including
sys.argv[0], which is the name of the program
- Second is a string telling it what flags to look for, and whether they take arguments
"a:bcd:" means “-a and -d have an argument, -b and -c don't”
doPrintable = True
doTabs = True
settings, filenames = getopt.getopt(sys.argv[1:], 'pt')
for (opt, arg) in settings:
if opt == '-p':
doPrintable = False
elif opt == '-t':
doTabs = False
- Have to make another change to
checkTabs
- It and
checkPrintable need to process the same data - So neither can read that data in from the file
- Unless we want to open and read the file twice
- Solution: separate input, processing, and output
- Main body reads data and calls validation functions
try:
if not filenames:
lines = sys.stdin.readlines()
checkTabs('<stdin>', lines)
checkPrintable('<stdin>', lines)
else:
for filename in filenames:
infile = open(filename, 'r')
lines = infile.readlines()
infile.close()
checkTabs(filename, lines)
checkPrintable(filename, lines)
except IOError, e:
print >> sys.stderr, e
- Change to
checkTabs is easy def checkTabs(filename, lines):
'''Look for tabs.'''
for line in lines:
if line.find('\t') >= 0:
print '%s contains tabs' % filename
break
- And
checkPrintable is simple as well def checkPrintable(filename, lines):
'''Look for non-printable characters.'''
for line in lines:
for c in line:
if c not in string.printable:
print '%s contains non-printable characters' % filename
print line
break
- Comment on this slide
Checking Glossary Entries
- Course has a glossary that defines new or unusual terms
- Entries in lectures are formatted as
<d ref="immutable">Immutable</d>
<d/> for “definition”ref attribute is the term that appears in the glossary- Contained text is displayed in-line
- Glossary is structured like this:
<glossary>
<glosssec title="A">
<glossitem id="absolute_path" term="absolute path">...definition...</glossitem>
<glossitem id="abstract_data_types" term="abstract data types">...</glossitem>
<glossitem id="access_control" term="access control">...</glossitem>
...
<glossitem id="automatic_variables" term="automatic variables (in Make)">...</glossitem>
</glosssec>
...
</glossary>
- Goal: make sure every term is defined once, and only once
- Read in the glossary
- Record the item IDs
- Read a set of files, marking items as they're seen
- If an item has already been marked, report the duplication
- At the end, look for items that haven't been marked off
- Hm…what happens if we're only checking one file?
- Want to be able to suppress the check for all items being marked off
- So need two flags:
- One to specify the name of the glossary file (so we don't try to read it as a lecture file)
- One to turn off the check for all items being marked off
glossary = None
doGlossaryComplete = True
doPrintable = True
doTabs = True
settings, filenames = getopt.getopt(sys.argv[1:], 'G:gpt')
for (opt, arg) in settings:
if opt == '-G':
glossary = arg
elif opt == '-g':
doGlossaryComplete = False
elif opt == '-p':
doPrintable = False
elif opt == '-t':
doTabs = False
- Note: if no glossary file specified, don't check glossary items at all
- Now have enough logic that it's worth reorganizing the main processing loop
- If no filenames specified, set the list of filenames to
['<stdin>'] - Write a function
readFile to open and read a file- If the function name is
"<stdin>", it reads from sys.stdin - Have function function return both a list of lines, and the XML DOM tree
- We need the first to check for tabs and printable characters, and the second to look for glossary items
- Modified processing code includes checks for what to do with the glossary
try:
if glossary:
glossary = readGlossaryFile(glossary)
if not filenames:
filenames = ['<stdin>']
for filename in filenames:
lines, doc = readFile(filename)
checkTabs(filename, lines)
checkPrintable(filename, lines)
if glossary:
checkGlossary(filename, doc, glossary)
if glossary and doGlossaryComplete:
checkGlossaryComplete(glossary)
except IOError, e:
print >> sys.stderr, e
except xml.parsers.expat.ExpatError, e:
print >> sys.stderr, e
- Then write
readFile
def readFile(filename):
if filename == '<stdin>':
data = sys.stdin.read()
else:
infile = open(filename, 'r')
data = infile.read()
infile.close()
infile = cStringIO.StringIO(data)
lines = infile.readlines()
doc = xml.dom.minidom.parseString(data)
return lines, doc
- Three functions left to write
readGlossaryFile builds a dictionary whose keys are the terms defined in the glossary- Values are
None for now—see why in a moment def readGlossaryFile(filename):
if filename is None:
return None
infile = open(filename, 'r')
doc = xml.dom.minidom.parse(infile)
terms = doc.getElementsByTagName('glossitem')
result = {}
for term in terms:
t = str(term.getAttribute('id'))
result[t] = None
return result
checkGlossary processes uses of glossary terms in a single lecture file- Records the name of the file in which the term appears in the
glossary dictionary - If some other filename is already there, reports duplicate definition
def checkGlossary(filename, doc, glossary):
defns = doc.getElementsByTagName('d')
for defn in defns:
d = str(defn.getAttribute('ref'))
if d not in glossary:
print 'term %s in %s missing from glossary' % (d, filename)
elif glossary[d] is not None:
print 'term %s defined in %s and %s' % (d, filename, glossary[d])
else:
glossary[d] = filename
checkGlossaryComplete looks for glossary entries without associated filenames- I.e., terms that are in the glossary, but aren't highlighted anywhere in the lectures
def checkGlossaryComplete(glossary):
unused = []
for g in glossary:
if glossary[g] is None:
unused.append(g)
if unused:
unused.sort()
print 'unused terms'
for u in unused:
print '\t%s' % u
- Comment on this slide
Checking Cross-References
- Last task (for now): check external files
- Every code fragment and image that's referenced must exist
- Every code file and image must be referenced
- Obvious opportunities for abstraction
- Get the set of files in a directory
- Get the set of files referenced in the lecture file
- Report differences in both directions
- Follow the pattern used for checking the glossary
-I dir specifies the root directory for images- If none provided, don't check images
-C dir specifies the root directory for code fragments- Use the value of the lecture's
id attribute to determine which particular subdirectory to search- No point having a “don't bother to check completeness” option, since each lecture's files are stored separately
- Add options to
getopt.getopt string, and four more lines to the main processing loop if codeRootDir:
checkFiles(filename, doc, codeRootDir, 'code', 'src')
if imageRootDir:
checkFiles(filename, doc, imageRootDir, 'img', 'src')
- Write
checkFiles
- Construct the path to the directory
- Find out what it contains
- Subtract things like the
.svn directory - Note how we've left room for other things to be excluded?
- Find out what's used in the document
- Using a
Set automatically handles multiple references to a single source file
- Rely on set subtraction to find differences
def checkFiles(filename, doc, rootDir, eltName, attrName):
# What should we ignore?
Excludes = ['.svn']
# Find out where we're supposed to look.
docId = str(doc.documentElement.getAttribute('id'))
dir = os.path.join(rootDir, docId)
if not os.path.isdir(dir):
print >> sys.stderr, 'Missing directory: %s' % dir
return
# Find out what's there that we care about.
actual = Set(os.listdir(dir))
for e in Excludes:
actual.discard(e)
# Find what's used in the document.
elts = doc.getElementsByTagName(eltName)
referenced = Set()
for e in elts:
if e.hasAttribute(attrName):
referenced.add(str(e.getAttribute(attrName)))
# Show differences (if any).
showDiff(filename, dir, 'not found', referenced - actual)
showDiff(filename, dir, 'unused', actual - referenced)
- Then write the helper function
showDiff
def showDiff(filename, dir, title, values):
if len(values):
print '%s (for file %s and directory %s):' % (title, filename, dir)
for v in values:
print '\t%s' % v
- Comment on this slide
Summary
- It took an hour to write and debug this code
- Started with something simple
- Tested everything as I wrote it
- Found and fixed 31 errors in the lecture notes
- Now type
make clean and make validate before doing a commit- Ensures that what goes into the repository has at least some chance of being sensible
- Ensures that what other people add to the course will conform to style and usage rules
- Course materials include several other tools
- Re-run sample programs and check that the output stored in the lectures is still correct
- Translate lecture notes, glossary, and bibliography into HTML
- Extract
summary values from each <topic/> element to create HTML version of course syllabus - Etc.
-
[Clark 2004]
discusses other things you can do to automate routine project maintenance tasks
- Prevents your project materials from rusting
- Which makes those materials easier to share
- And gives you higher confidence that they're working correctly
- Gives you more time to concentrate on things that actually require human attention
- Comment on this slide
Exercises
Exercise 19.1:
What does getopt do when it encounters an argument it
doesn't recognize? Write a short program that demonstrates this
behavior, that can be run on its own without the user passing in
any command-line arguments.
Binary Data
Isn't It All 'Binary'?
- All data is stored as 1's and 0's
- But those 1's and 0's may represent:
- Characters that can be displayed as text
- Something else
- That “something else” is (misleadingly) called binary data
- Usually means “anything you can't manipulate it with a standard text editor”
- Why use it?
- Size
"10239472" is 8 bytes long- The 32-bit integer it represents is 4 bytes
"False" is 5 bytes, 0 is one bit (40:1)
- Speed
- Takes dozens of operations to add the integer represented by
"34" to the one represented by "57"
- Hardware interfaces
- Somewhere, something has to convert the electrical signal from the gas chromatograph to a readable number
- Lack of anything better
- It's possible to represent images as text (ASCII art, PostScript)
- But sound? Or movies?
- A lot of scientific data is stored in binary formats
- Important to know how to work with it
- But please, don't write code to do this unless you absolutely have to
- Good libraries exist for working with every image, sound, and video format out there
- Comment on this slide
How Numbers Are Stored
- Positive numbers stored in base-2 format
- 10012 is (1×23)+(0×22)+(0×21)+(1×20) = 9
- Could use sign-and-value for negative numbers
- 00112 would be 310, 10112 would be -310
- But then there'd be two representations of zero (0000 and 1000)
- More efficient to use two's complement
- “Roll over” when going below zero, like a car's odometer
- 11112 is -110, 11102 is -210, etc.
- 10002 is the most negative 4-bit integer, 01112 the most positive
- Asymmetric: there is one more negative number than positive
- Since there has to be room for 0 in the middle
- Comment on this slide
Bitwise Operators
- Like most languages, Python has four operators that work on bits
| Name | Symbol | Purpose | Example |
|---|
| And | & | 1 if both bits are 1, 0 otherwise | 0110 & 1010 = 0010 |
| Or | | | 1 if either bit is 1 | 0110 & 1010 = 1110 |
| Xor | ^ | 1 if the bits are different, 0 if they're the same | 0110 & 1010 = 1100 |
| Not | ~ | Flip each bit | ~0110 = 1001 |
Table 20.1: Bitwise Operators in Python
- Note: the name “xor” is short for “exclusive or”, i.e., either/or
- Common to store several independent yes-or-no flags as a set of bits
- Example: need to record whether a sample contains any mercury, phosphorus, or chlorine
- Use bit 1 for mercury, bit 2 for phosphorus, bit 3 for chlorine
![[Using Bits to Record Sets of Flags]](img/binary/bit_flags.png)
Figure 20.1: Using Bits to Record Sets of Flags |
- Define constants to test for particular elements
# hex binary
MERCURY = 0x01 # 0001
PHOSPHORUS = 0x02 # 0010
CHLORINE = 0x04 # 0100
# Sample contains mercury and chlorine
sample = MERCURY | CHLORINE
print 'sample: %04x' % sample
# Check for various elements
for (flag, name) in [[MERCURY, "mercury"],
[PHOSPHORUS, "phosphorus"],
[CHLORINE, "chlorine"]]:
if sample & flag:
print 'sample contains', name
else:
print 'sample does not contain', name
sample: 0005
sample contains mercury
sample does not contain phosphorus
sample contains chlorine
- Can use and, or, and not to set specific bits to 1 or 0
- To set the ith of
x to 1:
- Create a value
mask in which bit i is 1 and all others are 0 - Use
x = x | mask
- To set the ith of
x to 0:
- Create a value
mask in which bit i is 1 and all others are 0 - Negate it using
~, so that the ith bit is 0, and all the others are 1 - Use
x = x & mask
- Comment on this slide
Shifting
- Shifting an integer's bits left N places written as
x << N
- Each leftward shift corresponds to multiplying by 2
- Just as shifting a decimal number left corresponds to multiplying by 10
- Example: 6<<2 is 01102<<2, or 11002, which is 12
- Shifting a number right corresponds to division by 2 (throwing away the remainder)
- 710>>1 is 01112>>1, or 00112, which is 310
- Shifting is not more efficient than multiplication and division on modern computers
- What happens if the top bit changes value as a result of a shift?
- 610<<1 = 01102<<1 = 11002
- On a 4-bit machine, this is -410, not 1210
- Some machines preserve the sign bit when shifting down
- So 11002>>1 = 11102, instead of 01102
- Depends on the hardware being used
- Java provides a separate operator for this
- Comment on this slide
Floating Point
- Floating point numbers are more complicated
- A 32-bit float has:
- One bit for the sign
- 23 bits for the mantissa (or value)
- 8 bits for the exponent
![[Floating Point Representation]](img/binary/float_rep.png)
Figure 20.2: Floating Point Representation |
- This means that you can only represent a fixed set of values
- If the actual value isn't in that set, you must settle for the closest available approximation
- Consequence #1: values are unevenly spaced
- Imagine we only had 6 bits for each floating-point number (1 sign, 3 mantissa, 2 exponent)
- Means less absolute precision for numbers with larger magnitudes
![[Uneven Spacing of Floating-Point Numbers]](img/binary/uneven_spacing.png)
Figure 20.3: Uneven Spacing of Floating-Point Numbers |
- Consequence #2: roundoff errors
- Our system can represent 6, and it can represent ¼, but not 5¾
- So 6 - 0.25 is 6, not 5.75
- And if 6 - 0.25 - 0.25 - 0.25 - 0.25 is evaluated left to right, the answer is still 6
- This is not random
- Happens exactly the same way every time
- But it is very hard to reason about
- Which is why people get Ph.D.'s in numerical analysis
- Comment on this slide
Binary I/O
- Read and write binary data using standard file methods
f.read(N) reads (up to) next N bytes- Result is returned as a string, but there's no guarantee its contents are readable
- If the file is empty, returns
None
f.write(str) writes the bytes in the string str
- Caution: must open files in binary mode on Windows
input = open(filename, 'rb') (and similarly for output)- Otherwise, the low-level routines Python relies on convert Windows line endings
"\r\n" to Unix-style "\n"… - …which is an unkind thing to do to an image
- Example: open a file using
"r", then in "rb"
- Identical on Unix, but different on Windows
import sys
print sys.platform
for mode in ('r', 'rb'):
f = open('open_binary.py', mode)
s = f.read(40)
f.close()
print repr(s)
linux2
'import sys\nprint sys.platform\nfor mode i'
'import sys\nprint sys.platform\nfor mode i'
- Comment on this slide
Packing and Unpacking
- In C and Fortran, an integer is a raw 32-bit value
fwrite(&array, sizeof(int), 3, file) will write 3 4-byte integers from address array to file
- Python, Java, and other languages usually don't use raw values
- There's no guarantee that things like lists are stored contiguously in memory…
- …so programs need to pack data into contiguous bytes for writing…
- …and unpack those bytes to recreate the structures when reading
- Packing looks a lot like formatting a string
- A format specifies the data types being packed (including sizes, where appropriate)
- This format exactly determines how much memory is required by the packed representation
- The result of packing is a chunk of bytes
- Stored as a string in Python
- But as mentioned above, it's not a string of characters
![[Packing Data]](img/binary/pack_data.png)
Figure 20.4: Packing Data |
- Unpacking reverses this process
- Read bytes from a “string” according to a format
- Use the data in these bytes to create Python data structures
- Return the result as a tuple of values
![[Unpacking Data]](img/binary/unpack_data.png)
Figure 20.5: Unpacking Data |
- Comment on this slide
The struct Module
- Use Python's
struct module to pack and unpackpack(fmt, v1, v2, …) packs the values v1, v2, etc. according to fmt, returning a stringunpack(fmt, str) unpacks the values in str according to fmt, returning a tuplecalcsize(fmt) calculates how large (in bytes) the data produced using fmt will be- Packing and unpacking are always fixed-size operations
- Format specifiers
| Format | Meaning |
|---|
"c" | Single character (i.e., string of length 1) |
"B" | Unsigned 8-bit integer |
"h" | Short (16-bit) integer |
"i" | 32-bit integer |
"f" | 32-bit float |
"d" | Double-precision (64-bit) float |
"2" | String of fixed size (see below) |
Table 20.2: Format Specifiers in Python
- Any format can be preceded by a count
- E.g.,
"4i" is four integers
- Must always specify the size of strings
- E.g.,
"4s" for a 4-character string - Otherwise, how would
unpack know how much data to use?
- How much data is packed is specified by the format
- Can pack the lowest 8 or 16 bits of an integer using
"B" or "h" instead of the full 32
- Example: pack two integers, using only 16 bits for each
import struct
fmt = 'hh' # two 16-bit integers
x = 31
y = 65
binary = struct.pack(fmt, x, y)
print 'binary representation:', repr(binary)
normal = struct.unpack(fmt, binary)
print 'back to normal:', normal
binary representation: '\x1f\x00A\x00'
back to normal: (31, 65)
- Uh, what's
'\x1f\x00A\x00'? - If Python finds a character in a string that doesn't have a printable representation, it prints a 2-digit hexadecimal (base-16) escape sequence
- Uses the letters A-F (or a-f) to represent the digits from 10 to 15
- So this string represents the four bytes
['\x1f', '\x00', 'A', '\x00']
- 1f16 is (1×16 + 15), or 31
- ASCII code for the letter
"A" is 6510
- Note that the least significant byte of the integer comes first
- This is called little-endian, and is used by all Intel processors
- Other chips put the most significant byte first (big-endian)
- If you move data from one architecture to another, it's your responsibility to flip the bytes…
- …because the machine doesn't know what the bytes mean
import struct
packed = struct.pack('4c', 'a', 'b', 'c', 'd')
print 'packed string:', repr(packed)
left16, right16 = struct.unpack('hh', packed)
print 'as two 16-bit integers:', left16, right16
all32 = struct.unpack('i', packed)
print 'as a single 32-bit integer', all32[0]
float32 = struct.unpack('f', packed)
print 'as a 32-bit float', float32[0]
packed string: 'abcd'
as two 16-bit integers: 25185 25699
as a single 32-bit integer 1684234849
as a 32-bit float 1.67779994081e+22
- Can be very hard to track down errors in packing and unpacking
- Unpacking the wrong type, or too many/too few values, produces garbage
- Comment on this slide
Packing Variable-Length Data
- How to store a variable-length vector of integers?
- Store the number of elements in a fixed-size header
- Then store that many integers one by one
![[Packing a Variable-Length Vector]](img/binary/packed_vec.png)
Figure 20.6: Packing a Variable-Length Vector |
- Packing is easy:
def packVec(vec):
buf = struct.pack('i', len(vec))
for v in vec:
buf += struct.pack('i', v)
return buf
- Unpacking is a little harder
- Use
calcsize('i') instead of hard-coding the number 4- 64-bit machines are on their way…
- Each pass through the unpacking loop, have to step up to the right location in the packed string
def unpackVec(buf):
# Get the count of the number of elements in the vector.
intSize = struct.calcsize('i')
count = struct.unpack('i', buf[0:intSize])[0]
# Get 'count' values, one by one.
pos = intSize
result = []
for i in range(count):
v = struct.unpack('i', buf[pos:pos+intSize])
result.append(v[0])
pos += intSize
return result
- And then we test:
if __name__ == '__main__':
tests = [
[],
[1],
[1, 2],
[-3, 456, 7890]
]
for t in tests:
s = packVec(t)
v = unpackVec(s)
assert v == t
- Problem: what if you want to pack strings, but don't know their length in advance?
- Solution: create the format string on the fly, and save the string's length as well as its characters
import struct
def packStrings(strings):
result = ''
for s in strings:
length = len(s)
format = 'i%ds' % length
result += struct.pack(format, length, s)
return result
def unpackStrings(buf):
intSize = struct.calcsize('i')
pos = 0
result = []
while pos < len(buf):
length = struct.unpack('i', buf[pos:pos+intSize])[0]
pos += intSize
format = '%ds' % length
s = struct.unpack(format, buf[pos:pos+length])[0]
pos += length
result.append(s)
return result
if __name__ == '__main__':
tests = ['', 'a', 'bcd']
for i in range(len(tests)):
t = tests[0:i+1]
buffer = packStrings(t)
result = unpackStrings(buffer)
assert result == t
- Comment on this slide
Metadata
- Literally, “data about data”
- I.e., data that describes other data, such as the date it was collected, or its format
- When creating binary files, put a header at the start of the file that describes the format of the data the file contains
- Advantages:
- One parser handles all data files
- Can't lose the format: programs come and go, but data is forever
- Disadvantages:
- Slower (generality always is)
- Reader is more complicated than a single special-purpose reader would be…
- …but simpler than the sum of all the special-purpose readers you'd have to write…
- …and you only have to debug it once
- Files have a three-part structure:
- Integer (fixed size) recording the length of the metadata
- Metadata (N bytes) describing the format of the records in the file
- The records themselves
![[Structure of a Binary File With Metadata]](img/binary/metadata.png)
Figure 20.7: Structure of a Binary File With Metadata |
- Step 1: store a list of identically-structured records to a file
def store(outstream, format, data):
'''Store records in a file. Each record is either a list or a
tuple; all records must have exactly the same format.'''
# Length of metadata.
s = struct.pack("i", len(format))
outstream.write(s)
# Metadata.
outstream.write(format)
# Records.
for d in data:
d = [format] + list(d) # Build pack's argument list
s = struct.pack(*d) # Apply pack to those arguments
outstream.write(s)
- Notice how
struct.pack is called- It takes each value to be packed as a separate argument, rather than taking a list of values
- First argument has to be the format
- So create a list with the format, and the values to be packed, and apply
struct.pack to it - Common pattern when using variable number of arguments
- Step 2: unpack the bytes created by
store
- Read the size of the metadata, then the metadata, then the data
def retrieve(instream):
'''Retrieve records from a file whose format is:
'i' : length of metadata
metadata : record format descriptor
records : actual data'''
# Length of metadata.
numBytes = struct.calcsize("i")
s = instream.read(numBytes)
formatLen = struct.unpack("i", s)[0]
# Metadata.
format = instream.read(formatLen)
recordLen = struct.calcsize(format)
# Records.
data = []
s = instream.read(recordLen)
while s:
assert len(s) == recordLen
d = list(struct.unpack(format, s))
data.append(d)
s = instream.read(recordLen)
# Done
return data
- Notice the
assert in the reading loop
- Step 3: test that it works
- Just as important as steps 1 and 2
if __name__ == '__main__':
from cStringIO import StringIO
tests = [
['i', [[1]]],
['i', [[1], [2]]],
['3s', [['xyz']]],
['3s', [['abc'], ['def']]],
['f3s', [[1.0, 'pqr']]],
['f3s', [[2.0, 'stu'], [3.0, 'vwx']]]
]
for (format, data) in tests:
outstream = StringIO()
store(outstream, format, data)
instream = StringIO(outstream.getvalue())
dup = retrieve(instream)
assert data == dup
- Comment on this slide
Relational Databases
It's All Tables
- Flat text and XML both have their place, but most of the data people really care about is stored in relational databases
- Originated with E. F. Codd's work in the late 1960s and early 1970s
- By the 1980s, Oracle and IBM's DB2 dominated the market
- Open source alternatives like
MySQL and PostgreSQL emerged in the 1990s SQLite offers a lightweight alternative for small jobs- Originally designed for use in small web sites
- Stores entire database in a single file on disk (which simplifies backup and recovery)
- The bad news: it's a huge topic
- Oracle's documentation would fill this room
- No matter what room you're in
- The good news: you only need to know a little to do most of the things you need to
- A few key ideas
- A little syntax
-
[Fehily 2003]
is a good tutorial and reference guide
- When to use a relational database:
- Lots of data
- Mostly facts, rather than objects
- I.e., not images, video clips, etc.
- Although a database may be the best place to store information about these
- Complex queries
- “Find all experiments done with the Mark VII that had yields greater than 30%, that didn't use cadmium disulfide as a reagant”
- Relational database can answer these questions directly
- Comment on this slide
Getting Started
- A database is a collection of zero or more tables, each of which has a name, and represents a relation
- Basically, each table stores one kind of fact
- Each table has a fixed set of named columns
- Like the fields in a form, or the members of an object
- All the values in a column have the same type
- Like XML attributes, values in a column shouldn't have any internal structure
- I.e., you shouldn't have to parse them in order to use them
- Each table has zero or more rows, each of which represents a separate record
- One or more values in each record form its key
- Unlike a dictionary, database doesn't guarantee uniqueness of keys: you have to
- Example: experimental data
![[Database Tables]](img/sql/database_tables.png)
Figure 21.1: Database Tables |
- We'll use these tables as a running example
- A foreign key is a value (or set of values) in one table that identifies a record in another
- For example, the values in the
Login column in Involved identify records in the Person table
- The example database illustrates an important rule of database design: no redundant information
- Natural to combine
Experiment and Involved tables into one ![[A Combined Table]](img/sql/combined_table.png)
Figure 21.2: A Combined Table |
- But now some information appears in two or more places…
- …which makes updates, consistency checks, and queries all harder to do
- Frequently have to make up attributes (like
InvolvedId) to relate these tables
- Comment on this slide
Basic SQL
- Interact with database using a specialized language called SQL (Structured Query Language)
- Several versions have been standardized…
- …and every database adds its own quirks
- To create a table, specify its name, and the names and types of its columns
- Should also specify whether values in that column can be null (i.e., undefined)
CREATE TABLE Person(
Login TEXT NOT NULL,
LastName TEXT NOT NULL,
FirstName TEXT NOT NULL
);
- To insert values into a table, specify the name of the tables, and the values to be inserted
- Each
INSERT creates a new row INSERT INTO Person VALUES("skol", "Kovalevskaya", "Sofia");
INSERT INTO Person VALUES("mlom", "Lomonosov", "Mikhail");
INSERT INTO Person VALUES("dmitri", "Mendeleev", "Dmitri");
INSERT INTO Person VALUES("ivan", "Pavlov", "Ivan");
- To erase a table, use
DROP TABLE name
- Remember, back up early, back up often…
- Usual way to run queries like this is to put them in a file, and give it to the database manager
- E.g.,
sqlite < make_running_example.sql
- Most database managers also provide an interactive prompt, GUI tools, etc.
- Comment on this slide
Simple Queries
- Suppose we want to get the login IDs of everyone in the system, along with their names
- Write this query almost exactly as we just said it
SELECT Person.FirstName, Person.LastName, Person.Login FROM Person;
Sofia|Kovalevskaya|skol
Mikhail|Lomonosov|mlom
Dmitri|Mendeleev|dmitri
Ivan|Pavlov|ivan
- Exact output depends on your database manager
- Usually run it through a text filter, or view it in a GUI
- How about sorting rows by login ID instead of by the order they were added to the database?
SELECT Person.FirstName, Person.LastName, Person.Login
FROM Person
ORDER BY Person.Login ASC;
Dmitri|Mendeleev|dmitri
Ivan|Pavlov|ivan
Mikhail|Lomonosov|mlom
Sofia|Kovalevskaya|skol
- Note: SQL doesn't stick to single-word keywords (e.g.,
ORDER BY)
- Comment on this slide
Joins
- A join is a query that combines information from two (or more) tables
- Construct the cross product of the tables by combining each row of the first with each row of the second
- Discard rows that don't meet some criteria
- Select columns from the surviving rows
- Example: find out who has done experiments for each project
SELECT Project.ProjectName, Involved.Login
FROM Project, Involved
WHERE Project.ProjectId = Involved.ProjectId;
Antigravity|mlom
Antigravity|mlom
Teleportation|dmitri
Teleportation|skol
Teleportation|ivan
Teleportation|mlom
Time Travel|skol
Time Travel|skol
Time Travel|ivan
- Construct cross product of
Person and Involved (which has 4×9=36 rows) - Throw away those for which the project IDs taken from the two tables don't match
- Select the columns of interest
![[Join In Action]](img/sql/join_process.png)
Figure 21.3: Join In Action |
- Note:
mlom appears twice for the Antigravity project because he did two experiments for it- Use the
DISTINCT keyword to eliminate duplicates SELECT DISTINCT Project.ProjectName, Involved.Login
FROM Project, Involved
WHERE Project.ProjectId = Involved.ProjectId;
Antigravity|mlom
Teleportation|dmitri
Teleportation|skol
Teleportation|ivan
Teleportation|mlom
Time Travel|skol
Time Travel|ivan
- How to find people who have done experiments for two (or more) projects?
- First attempt: use
AND
SELECT DISTINCT Person.Login
FROM Person, Involved
WHERE (ProjectId = 1214) AND (ProjectId = 1709);
- Doesn't work because
ProjectId cannot simultaneously be 1214 and 1709
- Second attempt: use
OR
SELECT DISTINCT Person.Login
FROM Person, Involved
WHERE (ProjectId = 1214) OR (ProjectId = 1709);
skol
mlom
dmitri
ivan
- Doesn't work because every project's ID is either 1214 and 1709
- Right solution that works is to join the
Involved table with itself, so that we have two project IDs in the same row- Then select rows where the person is the same, but the project IDs are different
- Remember, everything is done row-by-row
SELECT DISTINCT A.Login
FROM Involved A, Involved B
WHERE (A.Login = B.Login)
AND (A.ProjectId != B.ProjectId);
mlom
skol
ivan
- Comment on this slide
Negation and Nested Queries
- How to find everyone who hasn't been experimenting with time travel?
- Can't just select rows of
Involved where ProjectId is not 1737, since both Kovalevskaya and Pavlov have worked on other projects SELECT DISTINCT Involved.Login
FROM Involved
WHERE (Involved.ProjectId != 1737);
mlom
dmitri
skol
ivan
- Solution is counter-intuitive, but well-defined:
- Select the rows we don't want
- Subtract them from all possible rows
- Requires use of nested queries
- Database manager runs the inner query first, then applies the outer query to the inner query's result
NOT IN means exactly what it saysSELECT DISTINCT Login
FROM Involved
WHERE Login NOT IN
(SELECT Login
FROM Involved
WHERE Involved.ProjectId = 1737);
mlom
dmitri
- These are useful for many other things as well
- Example: how many people have done experiments for exactly one project?
- Solution: find the people who've done experiments for none, or for two or more, and subtract them from everyone
SELECT DISTINCT Login
FROM Involved
WHERE Login NOT IN
(SELECT A.Login
FROM Involved A, Involved B
WHERE (A.Login = B.Login)
AND (A.ProjectId != B.ProjectId));
dmitri
- Comment on this slide
Aggregation
- Often need to aggregate (combine) values from different rows
- Total sales, fastest trip, etc.
- SQL provides built-in functions
- Example: how much time has Mikhail spent on antigravity experiments?
SELECT SUM(Experiment.Hours)
FROM Involved, Experiment
WHERE (Involved.Login = "mlom")
AND (Involved.ProjectId = 1214)
AND (Involved.ProjectId = Experiment.ProjectId)
AND (Involved.ExperimentId = Experiment.ExperimentId);
15.8
- How much time has each scientist spent on experiments?
- Use
GROUP BY to apply aggregation function to specific subsets of rows SELECT Involved.Login, SUM(Experiment.Hours)
FROM Involved, Experiment
WHERE (Involved.ProjectId = Experiment.ProjectId)
AND (Involved.ExperimentId = Experiment.ExperimentId)
GROUP BY Involved.Login;
ivan|5.5
dmitri|7
skol|4.5
mlom|23
- Note: negative hours on time travel experiments really mess up budgeting…
- Comment on this slide
Using Other Languages
- Usually don't write entire application in SQL, or run SQL in sub-shell
- Instead, embed SQL in the programming language of your choice
- Visual Basic, Java, Perl, Python, …
- PHP was designed for exactly this purpose
- Key concepts:
- A connection is a way to get commands from the program to the database, and results back
- Typically a socket, but other methods are used as well
- A cursor is a pointer to a database operation
- Like a cursor in an editor
- Example: get the names of all the scientists into a Python program
import sqlite
connection = sqlite.connect("running.db")
cursor = connection.cursor()
cursor.execute("SELECT FirstName, LastName FROM Person;")
results = cursor.fetchall();
for r in results:
print r
cursor.close();
connection.close();
('Sofia', 'Kovalevskaya')
('Mikhail', 'Lomonosov')
('Dmitri', 'Mendeleev')
('Ivan', 'Pavlov')
- Need the right driver to connect to the database
- Everything else (on the Python side) is the same, regardless of the kind of database
- Although as mentioned before, the SQL might be different
- Comment on this slide
Handling Null Values
- Real-world data always has holes in it
- Some people don't have cell phone numbers, some authors' birth dates are unknown…
- Represent this in a database using the special value
NULL
- Not the same as zero, empty string, False, etc.
- Instead, means “nothing known at all”
- Check to see if a value is null using
IS NULL - The result of any computation involving
NULL is NULL
2 + NULL is NULL, NULL OR True is NULL, etc.
- Although in some databases,
False AND NULL is False, and True OR NULL is True
- By default, columns may contain
NULL, but this can be prohibited when the table is created- An example of an integrity constraint
CREATE TABLE Experiment(
ProjectId INTEGER NOT NULL,
ExperimentId INTEGER NOT NULL,
NumInvolved INTEGER NOT NULL,
ExperimentDate DATE,
Hours REAL NOT NULL
);
- Have to write queries to account for
NULL
Experiment.ExperimentDate <> 1901-05-01 selects all experiments that weren't conducted on May 1, 1901, and all experiments whose date is NULL (since NULL isn't equal to anything except itself)- Have to use
(Experiment.ExperimentDate <> 1901-05-01) AND (Experiment.ExperimentDate IS NOT NULL)
- Comment on this slide
Concurrency
- The biggest challenge in database programming isn't formulated queries—it's handling concurrency
- Two or more things happening at once
- In the database world, one user changing the database while another is making a query
- A related problem is the need to ensure atomic (or indivisible) operations
- Sequence of operations is “remove $100.00 from grant #19823”, then “add $100.00 to grant #17928”
- Don't want money to disappear if computer goes down in between
- Solution to both problems is to use a transaction
- A set of operations which all work as if nothing else was going on, or have no effect on the database
- ACID:
- Atomic: either all are performed, or none
- Consistent: database is in a legal state when the transaction ends
- Isolated: no operation outside the transaction sees the database in any intermediate state
- Durable: once the user is notified that the operation has completed, its effects are permanent
- Transactions sometimes used for queries, but should always be used for updates
- Example: change Kovalevskaya's login ID from
"skol" to "kovalev" BEGIN TRANSACTION;
UPDATE Person
SET Login = "kovalev"
WHERE Login = "skol";
UPDATE Involved
SET Login = "kovalev"
WHERE Login = "skol";
END TRANSACTION;
SELECT *
FROM Person
WHERE (Login = "kovalev") OR (Login = "skol");
SELECT *
FROM Involved
WHERE (Login = "kovalev") OR (Login = "skol");
kovalev|Kovalevskaya|Sofia
1709|1|2|kovalev
1737|1|1|kovalev
1737|2|1|kovalev
- Wrap the changes in a transaction so that:
- No query can run after
Person changes, but before Involved changes - If database goes down in the middle of the operation, database doesn't become inconsistent
- So why not use transactions everywhere?
- Because they require the database to serialize some operations
- Which slows the system down
- Comment on this slide
Client-Side Web Programming
Small Pieces, Loosely Joined
- The Unix command line was the world's first component object model
- Allowed programmers to build small pieces, then connect them in arbitrary ways
- Key features:
- Low barriers to entry: it's easy to add one more tool to the toolbox
- Common data format: stream of strings
- Common communication protocol:
stdin, stdout, and zero/nonzero exit codes
- The Web succeeded (in part) because it followed the same model
- Data format: HTML (now XML)
- Communication protocol: HTTP
- This lecture looks at how to build programs that get data from the web
- Next lecture will discuss simple ways to build programs that supply data
- If you want to know more,
[Goerzen 2004]
is a comprehensive guide
- Comment on this slide
Distributed Is Different
- Distributed systems are fundamentally different from unitary systems
- Small program can ignore these differences…
- …but every industrial-strength application eventually has to deal with them
- Difference #1: concurrency
- As in databases, means “several things happening at once”
- Can lead to:
- Deadlock: A is waiting for B while B is waiting for A
- Race conditions: final result depends on whether A or B goes last
- Difference #2: partial failure
- One component fails while others are still healthy
- If you've waited five seconds for a web site to respond, should you assume that it's down, or keep waiting?
- There's a whole industry devoted to dealing with this
- Both differences make distributed applications much harder to debug than unitary ones
- Often have heisenbugs (which only appear intermittently)
- And it's usually impossible to get a complete picture of the system's state
- Solution: get it right the first time
- For now, we will ignore both problems
- Write small programs that use the web, but aren't used by others
- Assume that everything else is reliable
- Comment on this slide
Under the Hood
- These days, the Internet runs on a family of communication standards called Internet Protocol (IP)
- Defines everything from electrical signal strengths on up
- Using IP, process communicate through sockets
- Each socket is one end of a point-to-point communication channel
- Provides the same kind of read and write operations as files
- Every socket is uniquely identified by host address and a port number
- A host address is four 8-bit numbers, like 24.153.22.195
- Most addresses have symbolic names, like
www.third-bit.com
- A port is just a number in the range 0-65535
- 0-1023 are reserved for the operating system
- Most commonly used member of the IP family is Transmission Control Protocol (TCP)
- Guarantees that any data sent is received, in order
- Lots of applications use TCP directly
- Email, instant messaging, FTP, games…
- But don't worry: we'll be programming at a much higher level
- Comment on this slide
The Hypertext Transfer Protocol
- HTTP (the HyperText Transfer Protocol) specifies how programs exchange documents over the web
- Originally designed to support file download
- Has grown in many ways
- Three stages in a typical transaction:
![[HTTP Request Cycle]](img/client/http_cycle.png)
Figure 22.1: HTTP Request Cycle |
- A client program opens a connection to a web server
- The client sends a request containing
- The thing the client wants (such as the name of a web page)
- Headers that define the type of transaction
- Anything else the server might need (e.g., a file for upload)
- The server replies with:
- Headers that define the status (success or failure) and the type of data being sent
- Data (such as the actual web page)
- This cycle may be repeated many times to display one web page
- Each image is requested and downloaded separately
- Comment on this slide
HTTP Request
- An HTTP request has three parts
![[HTTP Request Format]](img/client/http_request.png)
Figure 22.2: HTTP Request Format |
- HTTP request line
- HTTP method is usually either
GET or POST
GET typically used to fetch informationPOST usually used to submit form data, or to upload files
- URL is the thing the request wants
- Typically a path to a file, such as
/index.html - But other information (discussed below) can be put there as well
- Entirely up to the server how to interpret the URL
- HTTP version
- Headers
- A set of name/value pairs
- A given name may appear many times, i.e., may be associated with many values
- Examples:
Accept: text/htmlAccept-Language: en, frIf-Modified-Since: 16-May-2005
- Body is any extra data associated with the request
- Used with web forms, to upload files, etc.
- Note: there's no magic
- An HTTP request is just text—any program that wants to can create them or parse them
- Comment on this slide
HTTP Response
- Just as simple as a request
![[HTTP Response Format]](img/client/http_response.png)
Figure 22.3: HTTP Response Format |
- HTTP response line
- Once again, version is typically
HTTP/1.0 - Status code is a number indicating what happened
- 200: everything worked
- 404: page not found
- Status phrase repeats that information in a human-readable phrase (like “OK” or “not found”)
- Headers are once again name/value pairs
- Then there must be a blank line to signal the end of headers
- Common error to forget this
- Body is whatever data the server wants to send back
- A web page, an image, etc.
- Comment on this slide
Example
- Fetch a page from the course site
import sys, socket
HttpRequest = \
'''GET /greeting.html HTTP/1.0
'''
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("www.third-bit.com", 80))
s.send(HttpRequest)
response = ''
while True:
data = s.recv(1024)
if not data:
break
response += data
s.close()
print response
HTTP/1.1 404 Not Found
Date: Sat, 02 Jul 2005 11:16:44 GMT
Server: Apache/2.0.54 (Debian GNU/Linux) DAV/2 SVN/1.1.4 mod_python/3.1.3 Python/2.3.5 PHP/4.3.10-9 mod_ssl/2.0.54 OpenSSL/0.9.7e
Content-Length: 394
Connection: close
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /greeting.html was not found on this server.</p>
<hr>
<address>Apache/2.0.54 (Debian GNU/Linux) DAV/2 SVN/1.1.4 mod_python/3.1.3 Python/2.3.5 PHP/4.3.10-9 mod_ssl/2.0.54 OpenSSL/0.9.7e Server at pyre.third-bit.com Port 80</address>
</body></html>
- Request has no headers
- So the blank line that signals …end of headers” is right after the request line
- Create a socket
socket.AF_INET means “an Internet socket”socket.SOCK_STREAM means “use TCP”
- Connect to the server
- Port 80 is the standard one for web servers
- Send the request, and read response data until there's nothing more to read
- Comment on this slide
Fetching Pages
- Opening sockets, constructing HTTP requests, and parsing responses is tedious
- So most languages provide libraries to do the work for you
- In Python, that library is called
urllib
urllib.urlopen(URL) does what your browser would do if you gave it the URL- Parse it to figure out what server to connect to
- Connect to that server
- Send an HTTP request
- Give the calling program an object that looks like a file, from which to read response data
- Example: read the test page the easy way
import urllib
instream = urllib.urlopen("http://www.third-bit.com/greeting.html")
lines = instream.readlines()
instream.close()
for line in lines[:5]:
print line,
- Note:
readlines wouldn't do the right thing if the thing being read was an image- Use
read to grab the bytes in that case - Up to the client application to do the right thing
- Example: make a list of the links in a web page
- The first step in build a web spider that can explore the web on its own
- That, plus a search engine, and you're Google…
- Fetch the page
- Parse it to extract all the links
- Should use XML, but regular expressions are good enough to give you the idea
import urllib, re
from sets import Set
instream = urllib.urlopen("http://www.third-bit.com/index.html")
page = instream.read()
instream.close()
links = re.findall(r'href=\"[^\"]+\"', page)
temp = Set()
for x in links:
temp.add(x[6:-1])
links = list(temp)
links.sort()
for x in links:
print x
#ack
#news
#rules
#search
/MT/mt.cgi
/blog
/cgi-bin/awstats.pl?config=www.third-bit.com
http://books.slashdot.org/books/05/06/16/1930227.shtml
http://projects.edgewall.com/trac
http://subversion.tigris.org
http://weblog.infoworld.com/udell
http://weblog.infoworld.com/udell/2005/06/23.html
http://www.amazon.com/exec/obidos/redirect?path=ASIN/0974514071&link_code=as2&camp=1789&tag=thirdbit-20&creative=9325
http://www.amazon.com/exec/obidos/redirect?tag=thirdbit-20&creative=9325&camp=1789&link_code=as2&path=ASIN/0974514071
http://www.apache.org
http://www.artima.com
http://www.artima.com/chapters/book.jsp?num=114451
http://www.cs.utoronto.ca
http://www.debian.org
http://www.jonahgroup.com
http://www.list.org
http://www.oreillynet.com
http://www.oreillynet.com/cs/user/print/a/5862
http://www.pragmaticprogrammer.com/titles/gwd/index.html
http://www.provost.utoronto.ca
http://www.python.org
http://www.python.org/psf
http://www.sf.net
http://www.techbookreport.com/tbr0172.html
http://www.third-bit.com/pyweb/index.html
http://www.third-bit.com/swc
http://www.third-bit.com/trac/argon
http://www.third-bit.com/trac/pyre
http://www.third-bit.com/trac/pyre/newticket
http://www.third-bit.com/trac/swc
http://www.third-bit.com/~gvwilson
http://www.utoronto.ca
mailinglists.html
mailto:gvwilson@cs.utoronto.ca
old-presentations.html
past.html
reading.html
style.css
- Comment on this slide
Passing Parameters
- Sometimes want to provide extra information as part of a URL
- Example: when searching on Google, have to specify what the search terms are
- Could do this as part of the URL
- Amazon puts ISBNs in URLs
- More flexible to add parameters to the URL
http://www.google.com?q=Python searches for pages related to Python"?" separates the parameters from the rest of the URL- If there are multiple parameters, they are separated from each other by
"&"
http://www.google.ca/search?q=Python&client=firefox
- Each parameter takes the form
"name=value" - Spaces are replaced by
"+" (so “web programming” becomes “web+programming”)
- We'll see what to do about actual
"+"'s in a moment
- Comment on this slide
Special Characters
- Question: what if you want to include
"?" or "&" in a parameter? - Answer: you URL encode it using
"%" followed by its 2-digit hexadecimal code| Character | Encoding |
|---|
"#" | %23 |
"$" | %24 |
"%" | %25 |
"&" | %26 |
"+" | %2B |
"," | %2C |
"/" | %2F |
":" | %3A |
";" | %3B |
"=" | %3D |
"?" | %3F |
"@" | %40 |
Table 22.1: URL Encoding
- Example: to search Google for “grade = A+”, use
http://www.google.ca/search?q=grade+%3D+A%2B - Helper functions:
urllib.quote(str) replaces special characters in str with escape sequencesurllib.unquote(str) replaces escape sequences with charactersurllib.urlencode(params) takes a dictionary, or a list of pairs, and constructs the entire query parameter stringimport urllib
print urllib.urlencode({'surname' : 'hopper', 'forename' : 'grace'})
print urllib.urlencode([('surname', 'turing'), ('forename', 'alan')])
surname=hopper&forename=grace
surname=turing&forename=alan
- Comment on this slide
Web Services
- Suppose you want to write a script that actually does search Google
- Construct a URL: easy
- Send it and read the response: no problem
- Parse the response: hmm, there's a lot of junk on the page…
- Many first-generation web applications relied on screen scraping
- Problem is that whenever the site changes its layout, your application breaks
- Modern web services are designed more intelligently
- When a client sends a request, it indicates that it wants machine-readable XML, rather than human-readable HTML
- Much easier to parse
- Much less likely to change over time
- An example of a remote procedure call
![[Remote Procedure Call Using HTTP]](img/client/rpc.png)
Figure 22.4: Remote Procedure Call Using HTTP |
- Actually, all HTTP requests are remote procedure calls
- Implicitly invoking a function called “fetch me this file”
- Most web services use the Simple Object Access Protocol (SOAP)
- Despite its name, it's anything but simple
- Luckily, there are libraries to hide the details for most widely-used web services
- Example: fetch information about a book from Amazon
- Amazon has defined an API for web services
- You need to get a license key in order to use it
- They're free
- Allows Amazon to throttle requests to one per second per client
- Use functions in
amazon.py to search by various criteria - Result is a list of objects with information about items that matched the search criteria
import sys, amazon
# Format multiple authors' names nicely.
def prettyName(arg):
if type(arg) in (list, tuple):
arg = ', '.join(arg[:-1]) + ' and ' + arg[-1]
return arg
if __name__ == '__main__':
# Get information.
key, asin = sys.argv[1], sys.argv[2]
amazon.setLicense(key)
items = amazon.searchByASIN(asin)
# Handle errors.
if not items:
print 'Nothing found for', asin
if len(items) > 1:
print len(items), 'items found for', asin
# Display information.
item = items[0]
productName = item.ProductName
ourPrice = item.OurPrice
authors = prettyName(item.Authors.Author)
print '%s: %s (%s)' % (authors, productName, ourPrice)
- Note: much more code devoted to creating human-readable output than to getting the information
- Comment on this slide
Summary
- Most computers now spend more time communicating than they do calculating
- Every few years, we put another layer on top of the pile of protocols to make communication easier
- Getting information from the web is now (almost) as easy as getting it from a file
- See in the next lecture how to provide information to others
- Comment on this slide
CGI
The Active Web
- The web is far more than just a fancy replacement for inter-library loan
- Most of its power comes from the fact that browsers can interact with programs
- More accurately, browsers can ask web servers to interact with programs on their behalf
- This lecture looks at what to do if you receive an HTTP request
- Very important that you also attend the next lecture on security before putting programs of your own on the web
- Comment on this slide
The Common Gateway Interface
- How to let everyday users write programs that handle HTTP requests?
- Option #1: Require them to write socket-level code
- Complicated and error-prone
- Can only have one program listening to a socket at a time
- Option #2: have the web server accept the HTTP request, and then invoke the user's code
- Recompiling the web server every time someone wants to add functionality would be a pain
- So define a protocol that lets web servers invoke arbitrary external programs
- The Common Gateway Interface (CGI) is a simple set of rules that specifies:
- How a web server passes information to a program
- How that program passes information back to the web server
- Yes, it's another protocol…
- CGI does not specify:
- A particular language
- You can use Fortran, the shell, C, Java, Perl, Python…
- How the web server figures out what program to run
- Each web server has its own rules
- We'll (briefly) talk about Apache's
- When a CGI is run, the web server sends it information:
- Through environment variables (which are strings)
| Name | Purpose | Example |
|---|
REQUEST_METHOD | What kind of HTTP request is being handled | GET or POST |
SCRIPT_NAME | The path to the script that's executing | /cgi-bin/post_photo.py |
QUERY_STRING | The query parameters following "?" in the URL | name=mydog.jpg&expires=never |
CONTENT_TYPE | The type of any extra data being sent with the request | img/jpeg |
CONTENT_LENGTH | How much extra data is being sent with the request (in bytes) | 17290 |
Table 23.1: CGI Environment Variables
- On standard input
- If there is extra data, the web server will send
CONTENT_LENGTH bytes to the CGI program on standard input
- The CGI program sends data back to the web server via standard output
- In most cases, the web server forwards this directly to the client
- Which means that the CGI program is responsible for creating headers
- Comment on this slide
MIME Types
- Clients and servers need a way to specify data types to each other
- Remember, bytes are just bytes
- Multipurpose Internet Mail Extensions (MIME) standard specifies how to do this
- Organizes data types into families, and provides a two-part name for each type
| Family | Specific Type | Describes |
|---|
| Text | text/html | Web pages |
| Image | image/jpeg | JPEG-format image |
| Audio | audio/x-mp3 | MP3 audio file |
Video | video/quicktime | Apple Quicktime video format |
| Application-specific data | application/pdf | Adobe PDF document |
Table 23.2: Common MIME Types
- Comment on this slide
Hello, CGI
- Simplest possible CGI pays no attention to query parameters or extra data
- Just prints HTML to standard output, to be relayed to the client
- Also prints the
Content-type header to tell the client to expect HTML- And a blank line to separate the headers from the data
#!/python/python.exe
# Headers and an extra blank line
print 'Content-type: text/html\n'
# Body
print '<html><body><p>Hello, CGI!</p></body></html>'
- But the whole point of CGI is to generate content dynamically
- Show a list of environment variables
#!/python/python.exe
import os, cgi
# Headers and an extra blank line
print 'Content-type: text/html\n'
# Body
print '<html><body>'
keys = os.environ.keys()
keys.sort()
for k in keys:
print '<p>%s: %s</p>' % (cgi.escape(k), cgi.escape(os.environ[k]))
print '</body></html>'
- You'll use this frequently when debugging…
- Comment on this slide
Creating Forms
- HTML forms allow users to enter text, choose items from lists, etc.
- Not nearly as sophisticated as desktop interfaces
- Although programmers are doing more every day (particularly using Javascript)
- Create a form using a
<form/> elementaction is the URL to send data tomethod specifies the type of HTTP request to send (e.g., "POST")
- Inside the form, can have:
<select/> elements to let users choose values from a list- List items specified using
<option/> elements
<input/> elements for other kind of data- If
type is "text", get a one-line text entry box - If
type is "checkbox", get an on/off checkbox "submit" and "reset" create buttons to submit the form, or re-set the data to initial values
<html>
<body>
<form action="http://www.third-bit.com/cgi-bin/simple_form.py" method="POST">
<p>Sequence: <input name="sequence" type="text" value="GATTACA"/>
Search type:
<select name="search_type">
<option>Exact match</option>
<option selected="selected">Similarity match</option>
<option>Sub-match</option>
</select>
</p>
<p>Programs:
<input checked="checked" name="program" type="checkbox" value="FROG-11">FROG (version 1.1)</input>
<input name="program" type="checkbox" value="FROG-beta">FROG (2.0 beta)</input>
<input checked="checked" name="program" type="checkbox" value="Bayes-Hart">Bayes-Hart</input>
</p>
<p>
<input type="submit"/>
<input type="reset"/>
</p>
</form>
</body>
</html>
![[A Simple Form]](img/server/simple_form.png)
Figure 23.1: A Simple Form |
- Notice that each
<input/> element has a name attribute- These become the names of the parameters that the client sends to the server
- The input elements' values are the parameters' values
- Submitting the form shown above with default values produces:
os.environ['REQUEST_METHOD']: "POST"os.environ['SCRIPT_NAME']: "/cgi-bin/simple_form.py"os.environ['CONTENT_TYPE']: "application/x-www-form-urlencoded"os.environ['REQUEST_LENGTH']: "80"- Standard input:
sequence=GATTACA&search_type=Similarity+match&program=FROG-11&program=Bayes-Hart
- Comment on this slide
Handling Forms
- We could handle form data directly
- Read and parse environment variables
- Read extra data from standard input
- The mechanics are the same each time, so use Python's
cgi module instead- Defines a dictionary-like object called
FieldStorage
- Keys are parameter names
- Values are either strings (if there's a single value assocatied with the parameter) or lists (if there are many)
- When one is created, it reads and stores information contained in the URL and environment
- Program can then read extra data from
sys.stdin
- Example: show the parameters send to a script
#!/usr/bin/env python
import cgi
print 'Content-type: text/html\n'
print '<html><body>'
form = cgi.FieldStorage()
for key in form.keys():
value = form.getvalue(key)
if isinstance(value, list):
value = '[' + ', '.join(value) + ']'
print '<p>%s: %s</p>' % (cgi.escape(key), cgi.escape(value))
print '</body></html>'
| URL | Value of a | Value of b |
|---|
http://www.third-bit.com/swc/show_params.py?a=0 | "0" | None |
http://www.third-bit.com/swc/show_params.py?a=0&b=hello | "0" | "hello" |
http://www.third-bit.com/swc/show_params.py?a=0&b=hello&a=22 | [0, 22] | "hello" |
- Comment on this slide
Development Tips
- When developing scripts, it helps to add
import cgitb; cgitb.enable() at the topcgitb is the CGI traceback module- When enabled, it will create a web page showing a stack trace when something goes wrong in your script
- Testing whether a
FieldStorage value is a string or a list is tedious- In almost all cases, you'll know whether to expect one value or many
- Use
FieldStorage.getfirst(name) to get the unique value- Returns the first, if there are many
FieldStorage.getlist(name) always returns a list of values- Empty list if there's no data associated with
name - If there's only one value, get a single-item list
- Comment on this slide
Maintaining State
- Often want to change the data a server is managing, as well as read it
- Update a description of an experiment
- Change your preferred email address
- Add items to a shopping cart
- Your CGI programs can do this any way they want
- You want it to be simple, fast, and secure…
- …but these are not always compatible
- The industrial-strength solution is to use a relational database
- A three-tier architecture (browser, web server, database)
![[Three Tier Architecture]](img/server/three_tier_architecture.png)
Figure 23.2: Three Tier Architecture |
- CGI program stuffs parameters from HTTP requests into SQL queries
- Runs the queries
- Translates results into HTML to send back to the client
- Simple programs can often get away with using files
- The CGI program re-reads the file each time it processes a request
- And re-writes it if there have been any updates
- Example: append messages to a web page
- Old messages are saved in a file, one per line
Hi, is anyone reading this site?
I was wondering the same thing.
I wasn't sure if we were supposed to post here.
Good point. Is there way to delete messages?
- Script checks the incoming parameters to decide what to do
- If
newmessage is there, append it, and display results - If
newmessage isn't there, someone's visiting the page, rather than submitting the form #!/usr/bin/env python
import sys, cgi
import cgitb; cgitb.enable()
# Get existing messages.
infile = open('messages.txt', 'r')
lines = [x.rstrip() for x in infile.readlines()]
infile.close()
# Add more data?
form = cgi.FieldStorage()
if form.has_key('newmessage'):
lines.append(form.getfirst('newmessage'))
outfile = open('messages.txt', 'w')
for line in lines:
print >> outfile, line
outfile.close()
# Display.
print 'Content-Type: text/html\n'
print '<html><body>'
for line in lines:
print '<p>' + line + '</p>'
print '''
<form action="http://www.third-bit.com/swc/message_form.py" method="post">
<p>Your message:
<input name="newmessage" type="text"/>
</p>
<p>
<input type="submit"/>
<input type="reset"/>
</p>
</form>
'''
print '</body></html>'
- Note: there is no static HTML page
- What the user sees is always generated by a program
- Note also: a lot of this program is devoted to copying values into an HTML template
- There are lots of good systems out there, in many languages, for doing this
- Comment on this slide
What About Concurrency?
- What happens if two users try to save messages at the same time?
- First instance of
message_form.py opens messages.txt, reads lines, closes file - Second instance opens
messages.txt, reads the same lines, closes file - First instance re-opens file, writes out original data plus one new line
- Second instance re-opens file, writes out original plus a different new line
- First instance's message has been lost!
- A race condition
- Solution is to lock the file
- As the name implies, gives one process exclusive rights to the file
- After the first process acquires the lock, any other process that tries to read or write the file is suspended until the first releases it
- Mechanics are different on different operating systems
- But the Python Cookbook includes a generic file locking function that works on both Unix and Windows
- Modified solution:
- Open file
- Acquire lock
- If someone else is using the file, blocks until file is available
- Read existing messages
- If there's a new message:
- Seek to the beginning of the file
- Overwrite data
- And oh yeah, release the lock
- Otherwise, everyone else will wait forever
#!/usr/bin/env python
import sys, cgi, fcntl
import cgitb; cgitb.enable()
# Get existing messages.
msgfile = open('messages.txt', 'r+')
fcntl.flock(msgfile.fileno(), fcntl.LOCK_EX)
lines = [x.rstrip() for x in msgfile.readlines()]
# Add more data?
form = cgi.FieldStorage()
if form.has_key('newmessage'):
lines.append(form.getfirst('newmessage'))
msgfile.seek(0)
for line in lines:
print >> msgfile, line
# Unlock and close.
fcntl.flock(msgfile.fileno(), fcntl.LOCK_UN)
msgfile.close()
# Display.
print 'Content-Type: text/html\n'
print '<html><body>'
for line in lines:
print '<p>' + line + '</p>'
print '''
<form action="http://www.third-bit.com/swc/locking_form.py" method="post">
<p>Your message:
<input name="newmessage" type="text"/>
</p>
<p>
<input type="submit"/>
<input type="reset"/>
</p>
</form>
'''
print '</body></html>'
- Comment on this slide
Who Are You Again?
- What about maintaining state on the client side?
- Need to know which shopping cart to display for a particular user
- But HTTP is a stateless protocol
- If a client makes a second (or third, or fourth…) request, server has no reliable way of connecting it to the first one
- Can guess based on host, elapsed time, etc.
- But it's just a guess
- Solution is for the server to create a cookie
- A string that is sent to the client in an HTTP response header
- Client saves it (either in memory or on disk)
- The next time the client sends a request to the site, it sends the cookie back to the server
- Like giving someone a claim check for their luggage
- To create a cookie:
- Create an instance of
Cookie.SimpleCookie
- Do not use
SmartCookie: it is potentially insecure
- Add values to the cookie (it acts like a dictionary)
- Typically add things like the user's ID, the time of the visit, etc.
- Convert it to a string (e.g., by printing it) to create the required HTTP header
- To use a cookie:
- Create a cookie
- Get the value associated with the environment variable
"HTTP_COOKIE" - Pass that value to the cookie's
load method
- Example: count the number of times a user has visited a web site
- If there's no cookie, create one with a count of 1
- Otherwise, increment the count
- Create a new cookie to send back to the user
- Display the count
#!/usr/bin/env python
import sys, os, Cookie
import cgitb; cgitb.enable()
# Get old count.
count = 0
if os.environ.has_key('HTTP_COOKIE'):
cookie = Cookie.SimpleCookie()
cookie.load(os.environ['HTTP_COOKIE'])
if cookie.has_key('count'):
count = int(cookie['count'].value)
# Create new count.
count += 1
cookie = Cookie.SimpleCookie()
cookie['count'] = count
# Display.
print 'Content-Type: text/html'
print cookie
print
print '<html><body>'
print '<p>Visits: %d</p>' % count
print '</body></html>'
- Can control how long a cookie is valid by setting an expiry value
- Either the number of milliseconds
- Or the time it should expire (in UTC)
- Use
time.asctime(time.gmtime())
- Warning: do not put sensitive information in cookies
- Browsers store them in files on disk
- Villains can watch network traffic, and steal data
- Cookies should instead be random values that act as keys into server-side information
- Talk about this more in the next lecture
- Comment on this slide
Summmary
- CGI is example of event-driven programming
- The framework invokes your code at specific times, and passes it specific information
- What happens the rest of the time isn't your concern
- At least, until something goes wrong, and you have to debug it
- Simple CGI programs can accomplish a lot
- The entire first generation of web applications were built this way
- But please, attend the next lecture on security before making anything publicly available
- Comment on this slide
Security
Evil Exists
- This may be the most important lecture in the course
- Assuming you care about your data, your bank account, your reputation…
- …and those of others
- Computer security is a collective responsibility
- A system is only as strong as its weakest component
- If you are creating CGI scripts, or sending data over the web, you are putting others at risk as well as yourself
- Impossible to cover anything more than the basics in this lecture
- Comment on this slide
What Are We Trying to Do?
- Security systems try to achieve many (often contradictory) goals
- Let everyone who should be able to do something do it easily…
- …while blocking people who shouldn't be able to…
- …and gathering information about their attempts
- All three parts are important
- Most people are trustworthy most of the time
- Preventing legitimate users from doing things annoys them
- If people are sufficiently annoyed, they'll turn security off, or find ways around it
- Example: best way to ensure villains can't access data is to erase it…
- …which encourages people to write it down on scraps of paper
- Have to design the system for the villainous minority
- Any system that relies on trust will eventually be abused
- Be honest: do you always put a quarter in the jar when you get a cup of coffee in the staff room?
- Keeping track of how villains are trying to break in is (almost) as important as preventing them
- A good way to find holes in your system
- Often important to build an audit trail in order to take legal or disciplinary action
- Comment on this slide
Technology Alone Cannot Solve the Problem!
- This is the only exclamation mark in a topic title in the whole course
- Most successful attacks use social engineering, rather than technology
- Call up your bank, and see if you can get your credit card balance without your PIN
- Helps if you sound like a grandmother who is close to tears because her poodle has just been hit by a car
- Second most successful way to attack a system is to get a job with the company running it
- Burn an extra copy of credit card data while backing up the server
- Take notes of all the “to be fixed later” points that come up during the security audit of the web site
- Many companies choose not to press charges, rather than deal with bad publicity after a security failure
- Example: AOL employee sold 92 million customer records to a spammer for $100,000
- Combination of inside job, and stealing the credentials of more privileged user
- Exploiting carelessness is the third
- Many people don't bother to change the default password on their wireless router
- Many more choose easily-guessed passwords
- “Easy” if you have the right tools, that is
- But remember: once one villain builds a tool, they can all use it
- In fact, technology often makes systems less secure
- Example: facial recognition software that works correctly 99% of the time
- So one person in a hundred is mistakenly identified as a potential terrorist
- 300,000 passengers a day in a busy airport
- That's one false alarm every 30 seconds
- How much attention do you think the guards will pay to the system after a week on the job?
- There is still a lot you can do, both as a user and as a programmer
- This lecture concentrates on the latter
- Comment on this slide
How to Think About Security
- Security systems are responsible for:
- Every exploit attacks one or more of these
- Convince the system you are someone else
- A regular user (if you're trying to buy stuff with someone else's credit card)
- An administrator (or someone else with special privileges)
- Convince it that you're allowed to do something you're not
- E.g., give yourself administrative privileges
- Circumvent its enforcement of the rules
- A browser bug lets Javascript in a page send copies of your cookies to a villain's site
- When analyzing security, look for ways to compromise the three A's
- Use defense in depth
- If your first security mechanism fails, is there another one behind it just in case?
- Remember: the real purpose of a bicycle lock is to convince thieves to go and steal something else
- First step is always risk assessment
- What is most important to you?
- I.e., what will cause the most pain if it's broken, or broken into?
- Comment on this slide
Rule 1: Don't Trust Your Input
- Anyone who knows the URL of your web application can send it data
- I.e., anyone who has a web browser, knows how to “View Source”, and can read HTML forms
- Therefore no guarantee that the HTTP request you receive was generated from your form
- The input provided for a selection list may not be one of the values you offered
- The input for a text field may be longer than the maximum you specified
- Some parameters may be missing from
QUERY_STRING, while unexpected ones may be present QUERY_STRING may not even be formatted according to the HTTP specification
- You must validate all of the input before using it
- Check that all the parameters you expect, and only those, are present
- Check that values meet constraints
- You usually have to do this anyway
- E.g., check that the string provided for “Atomic Number” actually represents an integer
- Or that a villain hasn't replaced
"price=399.99" with "price=3.99" in the query string
- Make sure that special characters have been escaped
- But watch out for double-escaping
- And don't reinvent the wheel: every language's library has routines to do URL escaping, HTML escaping, etc.
- In practice, if input can't be validated, best to reject the request entirely
- One exception: if your application is evolving, you might want to see if this input was meant for an old version
- Comment on this slide
Paths and I/O
- Always check filenames supplied by user
- Just a special case of the preceding rule, but an important one
- Example: controlling access to documents
- All your documents live below
/web/docs - When you get a URL, you check that
url[:9] == '/web/docs' to prevent people from accessing /home/dknuth/grades.xsl - So someone sends you
/web/docs/../../home/dknuth/grades.xsl - Solution: normalize paths before using them, and then check whether the result lies in or below the directories you want to open up
- Analysis: circumventing access control
- But also using a weakness in authorization
- The web server (which is running the CGI on the client's behalf) shouldn't have permission to read anything it doesn't absolutely need to
- Example: temporary files
- While handling requests, a CGI creates a temporary file
/tmp/1728397.cgidata (where 1728397 is a random number) - A villain with an account on the system writes a Python script that repeatedly looks in
/tmp for *.cgidata files- When it finds one, it makes a copy for the villain to inspect later
- Or overwrites its contents with fabricated data
- Analysis?
- Comment on this slide
Rule 2: Never Run User Commands
- Suppose you want to build a search engine
- Can't be bothered to look for strings in files yourself, so you run
grep in a sub-process #!/usr/bin/env python
import os, cgi
form = cgi.FieldStorage()
term = form['term']
cmd = 'grep %s data.txt' % term
instream = os.popen(cmd, 'r')
results = instream.readlines()
instream.close()
print 'Content-type: text/html\n'
print '<html><body>'
for line in results:
print '<p>', cgi.escape(line), '</p>'
print '</body></html>'
- What happens if someone sends a URL with
"term=dummy /dev/null;rm -rf *"?
cmd becomes grep term=dummy /dev/null; rm -rf * data.txtgrep produces no output (since /dev/null is always empty)- Then the web server deletes everything it can, recursively
- Oops
- Adding quotes around terms and escaping special characters hide the problem rather than eliminating it
- Right solution: don't rely on external programs
- More specifically, if there's a layer of interpretation between you and what you're running, then there's one more weak spot for a villain to attack
- Note that the same thing can happen even if you don't rely on outside programs
- If
expr is a string, eval(expr) interprets it as if it had been typed in then and there - Sometimes used to let people specify complex search criteria
- HTTP request contains
"criteria=record.left+record.right<0" - CGI plugs it into a standard loop
#!/usr/bin/env python
import os, cgi
form = cgi.FieldStorage()
criteria = form['criteria']
results = []
for record in readRecords():
keep = eval(criteria)
if keep:
results.append(record.title)
print 'Content-type: text/html\n'
print '<html><body>'
for title in results:
print '<p>', cgi.escape(title), '</p>'
print '</body></html>'
- But it has the same problem as using the shell
- Exercise for the reader: construct an HTTP query string that will do something evil
- SQL injection is a variation on the theme
- Your program expects a user name
- Constructs SQL to query the database like this:
query = "SELECT * FROM Data WHERE Username='%s'" % username
cursor.execute(query)
results = cursor.fetchall();
- But someone enters this as their user name:
fred'; UPDATE Employees SET Salary = 1000000 WHERE username = 'me
- The resulting SQL is legal (all the quotes match), but probably isn't what the programmer wanted
- Cross-site scripting is the same problem in another guise
- If a bulletin board or discussion forum lets you post HTML, you can include Javascript in it
- When someone views what you wrote, their browser runs that Javascript
- It can harvest information from your machine, and post it to the attacker
- Comment on this slide
Cryptography 101
- Encryption is the process of obscuring information so that it can't be read without special knowledge
- An algorithm for encrypting and decrypting is called a cipher
- Original and encrypted messages are called plaintext and ciphertext respectively
- All classical (pre-1970s) ciphers are symmetric
- Same key is used for both encryption and decryption
- Which means that the key can only be shared among trusted parties
- Asymmetric ciphers have two keys
- Each undoes the other's effects
- (Practically) impossible to determine one given the other
- Asymmetric systems are often called public-key systems
- Publish one key, keep the other secret
- The published key is called the public key, the unpublished one is called the private key
- If you know how the underlying modulus is built you can relatively easily derive one key from the other.
- Anyone who wants to send a message to you encrypts it using the public key
- You're the only one who can decrypt it
- Asymmetric systems can be used sign a message.
- You send a text and append cypher of it encrypted with your private key.
- Everybody can decript the cypher text with the public key and if it must be identical with the clear text.
- Then it is guaranteed that the message was encrypted with the private key (unless the system is cracked).
- It is also guaranteed that the clear text was not changed, e.g. to not buy 10 cars instead of 1.
- Which is better then for regular signatures.
![[Secure Communication with Asymmetric Keys]](img/security/public_keys.png)
Figure 24.1: Secure Communication with Asymmetric Keys |
- Note: symmetric encryption is typically many times faster than asymmetric encryption
- Usual scheme these days is to use asymmetric encryption (slow) to exchange a one-time symmetric key
- Then use the symmetric key (fast) for the rest of the transaction
- Comment on this slide
How Asymmetric Cryptography Works?
- Best know algorithm is RSA
- Invented by Ron Rivest, Adi Shamir and Len Adleman in 1977 at MIT.
- Where might the name RSA come from?
- Bases on Fermat's little theorem
- For every a it is a ** (p-1) % p == 1 if p is a prime number.
- Now choose two large prime numbers p and q.
- Compute n = p*q.
- How n is factorized is actually the important secret.
- Choose an integer e != 0 which is not disible by (p-1) and by (q-1).
- Compute an integer d such that d*e % ((p-1)(q-1)) == 1.
- Therefore (a ** e) ** d == (a ** d) ** e == a (modulo n).
- In other words, some text encrypted with e and than encrypted again with d returns to the original.
- Same way, first encrypted with d and then with e gives the original.
- n and e are published.
- Factorizing n allows you to generate d.
- If you can do this you get immediately an A in this course.
- And most likely a Turing Award, a Fields Medal or a Nobel Prize.
- And probably some trouble from secret services.
- Most people consider cracking RSA as complicated as factorization.
- But there is no proof that you cannot crack RSA in another way.
- Which computers can factorize numbers in polynomial time?
- Comment on this slide
Securing HTTP
- HTTP sends data as cleartext
- As is too often the case, security was ignored in HTTP's original design
- Netscape later developed HTTPS (Secure HTTP) to protect confidential information
- Uses a different port (443 instead of 80) protocol (
https in URL instead of http) - Encrypts data between the browser and the web server
- Does not guarantee secure storage on the server
- Far too many web sites store sensitive information in databases as cleartext
- Gives villains another point of attack
- Another flaw in HTTP is its built-in password handling (called basic authentication)
- Sends the user name and password as cleartext
- Actually, encoded using base-64…
- …but that's not a cipher
- Never use HTTP basic authentication
- Note: having users submit ID and password via a form is also insecure
- Form data isn't encrypted
- Alternative:
- Have user provide ID and password over secure connection
- Use a random number as a cookie
- Do not just use a sequence of integer session IDs: too easy for attackers to fabricate
- Give that to the client to track the session
- When it comes back, use it as a key into a dictionary of active sessions
- Still not perfect
- If villains can snoop on network traffic, they can hijack sessions
- Insert a copy of your cookie into their message
- Also vulnerable to replay attacks
- Copy the cookie (or an entire message) and re-send it later
- Useful if the message means “open the vault door”
- Note: none of this helps if there is spyware on the client machine
- These days, this is much more likely than someone sniffing network traffic
- Keep your anti-virus protection and spyware monitors up to date, and run them regularly
- What else is there for your machine to do at 3:00 a.m.?
- Comment on this slide
How do you login remotely?
- Data are send as cleartext in
"rlogin" and "telnet".
- Even account names and passwords are sent in clear text.
- Everybody on the same sub-net can easily track them.
- Use
"ssh" instead whenever possible.
- Most machines these have only ssh, anyway.
- For the same reasons use
"scp" instead of "ftp". - Comment on this slide
Do and Don't
- Always consult an expert when designing a system that relies on encryption
- Easy to use a perfectly good cipher in a way that either weakens the crypto, or provides no real security
- Example: 802.11 wireless networking took a cipher that is secure when used correctly…
- …and created a system which gives away both your data and your private key to eavesdroppers
- Then they burned it into ROM on millions of devices
- Never rely on keeping your algorithm secret
- Security must depend only on secrecy of keys
- Never design your own ciphers
- Unless you have a Ph.D. in cryptography, in which case you should be teaching this section of the course
- Use the best well-known ciphers—they have been thoroughly tested, and are more secure than anything you could invent
- 3DES or AES for symmetric encryption
- RSA, DSA, or EC-DSA for public-key
- Comment on this slide
Summary
- This lecture has only scratched the surface
- Have discussed only a tiny fraction of the issues that programmers face
- Haven't talked about user-level issues at all
- If you are going to do anything that's even a little bit complicated, please talk to a professional
- The guy down the hall who once read an O'Reilly book on Perl doesn't count
- Comment on this slide
Teamware
Introduction
- No programmer is an island
- You may be the only person writing code today, but you:
- Probably depend on tools and libraries written by other people
- May have inherited the project from someone else
- Ought to be thinking about who's going to take over once you're gone
- As open source spreads, being able to work in teams becomes even more important
- This lecture looks at some of the things teams have to do to stay productive
- All of which will make you more productive as an individual, too
- Introduce the ideas by showing you how to use a web-based project coordination tool called Trac
- Open source
- Reliable
- Easy to set up and install
- Modular, so that you can add new features (relatively) easily)
- Comment on this slide
Overview
- Trac is a Python CGI program that provides:
- A repository browser, so that users can inspect the contents and history of a version control repository
- An issue tracker for managing a shared to-do list
- Mailing lists, to help groups stay in touch
- A wiki for creating web pages and making links to other items in Trac
- A roadmap, for keeping track of upcoming milestones, and what's to be done for them
- A dashboard, which graphs statistics about project progress and status
- A timeline showing recent project activity
- A weblog, which reports timeline events to interested parties
- These are all integrated
- To-do items can refer to web pages, which can refer to particular revisions of files, etc.
- A single Trac installation can manage any number of projects
- Each project's files are stored in a separate Subversion repository
- Everything else stored in a single shared database
- SQLite for development work and small sites
- PostgreSQL for large sites
![[Trac Architecture]](img/teamware/trac_architecture.png)
Figure 25.1: Trac Architecture |
- Relies on host to handle authentication
- Every user must have an account on the underlying Linux system
- One less password for them to remember
- Home page looks like pretty much every other web portal's home page
![[Typical Trac Home Page]](img/teamware/trac_home_page.png)
Figure 25.2: Typical Trac Home Page |
- Examine each of the components in turn below
- Comment on this slide
Repository Browser and Timeline
- The repository browser is a web-based interface to Subversion
- Read-only: cannot update local files, or commit changes
- Implementing this would require giving web pages the ability to mess with your hard drive
- If you don't know why this is a bad thing, please re-read the lecture on security
- Read-only is more useful than you'd think
- Browse directories and files to see what they contain
![[Browsing Directories and Files]](img/teamware/repo_browser_dirs_files.png)
Figure 25.3: Browsing Directories and Files |
- View old versions of files
![[Viewing an Old Version of a File]](img/teamware/repo_browser_old_file_version.png)
Figure 25.4: Viewing an Old Version of a File |
- See changes made to files over time
![[Viewing File Changes]](img/teamware/repo_browser_view_file_diff.png)
Figure 25.5: Viewing File Changes |
- Following a changeset link takes you to the timeline view
- Shows all recent project events
![[The Timeline]](img/teamware/timeline_all.png)
Figure 25.6: The Timeline |
- Filter control lets you select what types of events you want to see, from how long ago
![[Filtering Events]](img/teamware/timeline_filter.png)
Figure 25.7: Filtering Events |
- Very useful if you've been busy with other things for a while, and need to find out what's happened in your absence
- Comment on this slide
Issue Tracker
- An issue tracker (sometimes called a bug tracker is a tool for managing a shared to-do list
- What needs to be done?
- Who's responsible for each task?
- What state is each task in?
- Essential tool for managing long-lived or multi-person projects
- But only as useful as the information in it
- Trac's issue tracker is simpler than most, in order to make entering data very easy
- Most large projects use more complicated tools, such as Bugzilla
- To create a ticket:
- Follow the “New Ticket” link
![[Creating a New Ticket]](img/teamware/issue_tracker_new_ticket.png)
Figure 25.8: Creating a New Ticket |
- Fill in the fields
- Who you are
- A one-line title for the ticket
- What kind of ticket it is
- How important it is
- When it needs to be resolved
- A description of the issue
- And submit
- To see what tickets are already in the system:
- Follow either the “View My Tickets” or “View Project Tickets” link
- The first shows your tickets across all projects
- The second shows all tickets for the current project
- Tickets can be sorted and filtered in various ways
- Click on a ticket to view or update it
- All the information entered by the original author
- A flag showing whether it is open (i.e., still needs work) or closed (completed)
- Any comments other people have added to the ticket
- Every time you change a ticket, you should explain why
- Like commenting on changes you submit to version control
- Note: after updating a ticket, go back to the timeline view
- Time of change, ticket ID, and comment are displayed
- The display is a link back to the ticket
- Information is more useful when it's connected
- Comment on this slide
How to Write Tickets
- A badly-written ticket is better than no ticket at all, but not by much
- The summary should be short and informative
- Aim is to help people who are looking at 100 summaries find the one(s) they care about
- “Bug in seq comp” is bad
- “Seq. comparison returns wrong probabilities for bivalves” is much better
- Usually have three kinds of tickets:
- Bugs
- Feature requests (i.e., someone wants the software to do something it doesn't)
- Questions
- I.e., “Does anyone know how to parse external entities with Expat?”
- Or, “How many old versions of Windows are we going to support?”
- May seem odd to file these as issues, but they are for the person who's asking
- And it's more efficient than posting them to the mailing list
- Whole team has to agree on what importance values mean
- Low: cosmetic UI problem, minor workflow annoyance with obvious workaround, etc.
- High: program crashes, returns wrong result, inadvertently launches nuclear assault on Iceland, etc.
- Medium: anything in between
- In the weeks and days leading up to a release, perform regular triage
- Go through the open issues, and re-set their priorities so that everyone knows what they ought to be working on
- Description: provide all the information someone needs to know to address the issue
- For a bug, need:
- Software configuration, including package versions, operating system, etc.
- Sequence of steps or unit test case that triggers the fault
- Configuration files, input data, screenshots of the fault, etc.
- Similar level of detail for feature requests and questions
- Remember, it may be weeks or months before the issue is addressed
- Comment on this slide
Mailing Lists
- Trac assumes that everyone has a mail reader that they're happy with
- Doesn't try to compete
- No way to compose or send mail in Trac
- Users don't have mailboxes on the system
- Instead, Trac creates one mailing list for each project
- Everyone who's a member of a project is automatically on that list
- Only project members can send mail to it
- Every user must whitelist one or more email addresses with Trac
- The opposite of blacklisting: specifies who's allowed in, instead of who's to be kept out
![[Whitelisting an Email Address]](img/teamware/whitelist_email_address.png)
Figure 25.9: Whitelisting an Email Address |
- Trac will accept mail from any whitelisted address
- User must specify which of those addresses to forward project mail to
- This way, people can send from any external mail account, but all project mail arrives in one place
- Timeline shows old mail messages as well as other events
- Not threaded by topic: unnecessary complexity for small projects
- Comment on this slide
Wiki
- A wiki is a simple web-based system for creating web pages
- Term comes from the Hawaiian “wiki wiki”, meaning “faster faster”
- If you have logged in, and have permissions to edit the wiki, then every page has an “Edit” button
- Clicking this displays the page source in a text edit box
![[Editing a Wiki Page]](img/teamware/wiki_edit.png)
Figure 25.10: Editing a Wiki Page |
- Syntax is much simpler than standard HTML
- Blank lines separate paragraphs
- Any word in CamelCase is automatically interpreted as a link to a page with that name
= Title = creates a level-1 heading, == Subtitle == creates a level-2 heading, etc.- Series of indented lines beginning with
* is displayed as a list - Anything ending in
.png, .jpg, or .gif is automatically displayed as an image
- Doesn't let you do everything…
- …e.g., you can't specify alignment of table items…
- …but it makes it very easy to create meeting minutes, notes for first-time users, etc.
- Once you've made changes, you can preview the page, or commit the changes
- Pages are saved in the database, rather than in Subversion
- Translated from wiki syntax into HTML when the page is viewed
- Note: no conflict resolution mechanism
- If two people try to edit a wiki page at the same time, the second one to commit will be denied
- So put large and/or frequently-updated documents under version control instead
- Wiki syntax offers shortcuts for referring to other things in Trac
#22 links to ticket 22@41 links to email message 41[94] links to revision 94 in the version control repositorysource:path/to/filename.txt links to a particular file in the Subversion repositorysource:path/to/filename.txt#94 links to a particular version of the file
- You can use this same syntax in:
- Tickets (to refer to changesets, file versions, email messages, etc.)
- Email messages (as long as your mailer sends messages as plain text, rather than HTML)
- Subversion commit comments
- Just about everywhere else
- Remember, information is more useful when it's connected
- Comment on this slide
Roadmap and Milestones
- A milestone is a time by which something is supposed to be done
- A roadmap is just a collection of milestones
- Usually give milestone symbolic names, like “version 2 beta release”
- That way, when the date changes, you don't have a milestone called “April 1” whose due date is May 15
- Trac requires you to associate every issue with a milestone
- Default is the empty milestone, which means “sometime”
- Can sort and view tickets by milestone
- Moving low-priority tickets from upcoming milestone to later ones is part of pre-release triage process
- Much (much) simpler than most project management tools
- No Gantt charts, no dependency graphs, etc.
- They're overkill for small projects
- Particularly research projects, which frequently change direction
- Comment on this slide
Dashboard
- The dashboard is a graphical complement to the timeline view
- Plot changes to code, tickets, etc. over time
![[Dashboard Display]](img/teamware/dashboard.png)
Figure 25.11: Dashboard Display |
- Allow users to compare activity in various projects
- Has anyone been working on this project in the last few weeks?
- Has this project's test coverage been going up or down?
- In an educational setting, allow people to compare their projects against a class average
- Which can be very motivational
- Comment on this slide
Blogging
- Weblogs (or blogs) started off as on-line journals
- Author updates a file with new journal entries
- File written in an XML format called RSS
- Blog-reading software downloads file at irregular intervals
- If anything has changed since the last time, display the titles of new articles
![[How Blogs Work]](img/teamware/how_blogs_work.png)
Figure 25.12: How Blogs Work |
- A publish-subscribe system
- If you no longer want to read a blog, stop polling for updates
- Helps reduce information overload
- Every Trac project creates two blogs
- One includes only articles by people (announcements of new releases, commentary on the world at large, etc.)
- The other also includes all timeline events
- Project members and supervisors can subscribe to be notified of changes to files and tickets, etc.
![[A Trac Timeline Blog]](img/teamware/trac_timeline_blog.png)
Figure 25.13: A Trac Timeline Blog |
- The “Compose Blog Entry” page allows project members to create entries for the announcements blog
- Note: only people logged in to that Trac installation can comment on entries
- Occasionally annoying, but helps reduce comment spam
- Comment on this slide
Administration
- Trac is administered from the command line, rather than through the web
- Simpler to write and maintain
- Makes it simpler for administrators to create a dozen accounts in a single step, etc.
- Two interfaces:
- A standalone command-line tool called
trac-admin - A Python library, so that administrators can write scripts to do whatever they want
- As mentioned earlier, Trac relies on underlying operating system to manage user accounts
- Authentication is (only) as secure as your password policy
- Trac uses (person, project, permission) triples for authorization
- E.g.,
(aturing, enigma, WIKI_WRITE) or (ghopper, demo, TICKET_READ) - Every operation requires one or more permissions
- If there isn't a triple giving you a permission, you don't have it
- Some standard roles are pre-defined (and more can be added)
- A role is a set of permissions
(jvn, funiac, ROLE_DEVELOPER) gives a user all the permissions associated with the developer role
- Comment on this slide
Summary
- And that's it
- At least, so far—Trac is under active development
- All you need to run a small or medium-sized project
- Very useful for geographically-distributed collaborations
- Just as useful when one person has to manage his or her own work over the course of months or years
- Harder to lose a database than a lab notebook…
- If Trac is too small for your needs, there are many larger alternatives
- SourceForge is by far the most famous
- Will host open source projects for free on their machines
- Comment on this slide
Exercises
Exercise 25.1:
Can you find out what bugs are currently being worked on?
What feature requests have been deferred? Which files were changed
to fix a problem? What fixes are currently being tested? How long it
took to fix/implement something?
Exercise 25.2:
What is the status of the overnight build? The overnight
regression tests? The issue database? The team's
discussions?
Extreme Programming
Code Early, Code Often
- Knowing how to write code doesn't tell you how to run a project
- Where do you start?
- What do you do next?
- Who's responsible for what?
- How do you tell when you're done?
- Answers to these questions define a development process
- Dozens have been formalized and described
- But in practice, almost none are followed to the letter
- Every real-world situation and team is different, so real process is always a variation on the theme
- I always add extra garlic to my chili…
- …and just about everything else, come to think of it
- This lecture looks at a process called Extreme Programming (XP)
- Best suited to small teams that are trying to cope with constantly-changing requirements
- Which is a good description of many scientific research teams
- Look at two alternatives in the next couple of lectures
- Remember: every team is different
- Most important thing is not which process you use, but that you're paying attention to process
-
[DeMarco & Lister 1999]
is a good introduction to teams and productivity
- Comment on this slide
Waterfalls and Spirals
- In order to understand XP, you have to understand what it's a reaction against
- Also have to understand the cost of change
- The later a change is made, the more expensive it is to implement
![[The Cost of Change]](img/xp/cost_of_change.png)
Figure 26.1: The Cost of Change |
- Therefore important to make decisions early, and stick to them
- The waterfall model, described by Royce in 1970, describes development flowing through several phases
- Requirements analysis
- Design
- Implementation
- Testing
- Maintenance
![[The Waterfall Model]](img/xp/waterfall_model.png)
Figure 26.2: The Waterfall Model |
- Problems:
- If different people responsible for different phases, then no one has to deal with the consequences of their mistakes
- Whoever is responsible for testing has to make up all the lost time from the previous phases
- Time lag: it can take a long time for changes in requirements to filter through to the finished product
- The spiral model described by Boehm in 1986 wraps this on itself
- Get a goal from the customer, design, implement, and test it, and repeat
- Note: Royce actually advocated doing this too, but most people have forgotten that
- Key ideas:
- The code teaches you about the problem
- Customers can only find out what they actually want by playing with a working system
- But Boehm still envisaged cycles lasting from six months to two years
- Comment on this slide
Enter the Extremists
- XP arose in the 1990s to cope with:
- Ever-changing requirements
- Internet time
- Six-month iterations were longer than the lifespan of the average dot-com
- Assumption: change is cheaper than it used to be
- Note: less data to support this than some XP advocates would have you believe…
- Basic idea: each iteration lasts only one to four weeks
- Implies that each version is only a small step forward from the previous one
- Benefit is constant course corrections from the customers
- Equivalently, can re-direct project every couple of weeks to deal with changes
- Problem: how to ensure that developers stay on track when the track is constantly changing direction?
- Comment on this slide
Core Practices
- Solution is a process built around twelve core practices, which are usually grouped into four areas
- Detailed feedback
- Test driven development: write tests before code in order to get instant feedback on correctness
- Planning game: described below
- Onsite customer: must be a customer representative on site to answer developers' questions at all times
- Subsequent revisions to the process acknowledged that this is sometimes not possible
- But frequent contact between developers and customers still essential
- Pair programming: also described below
- Continuous process
- Small releases, so that customers can quickly check out all of the changes and additions
- Continuous integration: everyone builds and run tests on a very tight cycle
- Preferably several times an hour while coding
- Continuous refactoring: everyone cleans up what they're about to work on before they start work on it
- And follows through into other modules if there are ripple effects
- Ensures that code is always clean enough for the next person to start work
- Shared understanding
- Simple design: speaks for itself
- In practice, design of a rapidly-evolving system may seem simple to those who've watched it evolve, but not to newcomers
- System metaphor: use simple analogies to provide a high-level overview of the system
- E.g., it's an assembly line, a blackboard, etc.
- Gives everyone a common language with which to discuss it
- Collective code ownership: everyone is allowed to change any part of the system
- No one is a bottleneck (i.e., you're never waiting on “the GUI guy”)
- Everyone has to (potentially) eat their own cooking
- Coding standard: essential to making collective ownership possible
- Programmer welfare
- Sustainable pace: no one works overtime
- Because in almost all cases, overtime actually puts the project behind
- Tired people make more mistakes, which then have to be fixed
- And fixing things downstream is more time-consuming than slowing down and getting them right in the first place
- Programmer turnover delays projects as well
- Lose all the implicit knowledge in the departing programmer's head
- Remaining team members have to spend time bringing newcomers up to speed
- Comment on this slide
The Planning Game
- At the start of each iteration, customers write user stories to define the functionality they want added or changed
- Each one is just a few sentences long
- Each one should fit on a 3×5 index card
- Describes functionality, not features
- I.e., what the system can do, not how it does it
- It's the developers' job to figure out what features to add to implement the desired functionality
- Set of user stories define the project requirements
- As project evolves, throw away or update old ones as well as adding new ones
- Developers then estimate how long each user story will take to implement in ideal weeks
- I.e., five perfect interruption-free days
- Customers and developers then decide which stories to include in this iteration, and which to leave for the next
- If something is too big to fit into an iteration, break it down into smaller pieces
- You're going to do this anyway when developing it…
- Comment on this slide
Pair Programming
- Pair programming is the practice of having two people work together at one computer
- One types while the other watches and comments
- Predates XP, but was popularized by it
- Seems like a waste of resources, but:
- Two pairs of eyes are more likely to spot errors as they're being made than one
- In fact, because of cognitive dissonance, the person watching is much more likely to spot errors than the one typing
- Fewer errors translates into less time spent debugging
- The earlier something is fixed, the cheaper the fix is
- And pair programming helps ensure wider understanding of the system
- More people able to work on more things
- Feels strange the first time you do it
- And you need to reorganize desk layout to prevent back and neck problems
- But unlike many claims made in software engineering, there's empirical evidence that it actually works
[Williams & Kessler 2003]
- Though comparisons are usually made against two programmers who are both testing their own work…
- …rather than against one programmer developing, and another testing independently
- Comment on this slide
Project Velocity
- Once the iteration is under way, the team tracks the project velocity
- How quickly are user stories actually being implemented?
- Use this to predict how long it's going to take to finish the agreed-upon work
- If it has taken 70 programmer-days to implement the first ten stories, it's safe to bet that it'll take 140 PD to implement the remaining twenty
- Hoping for a miraculous improvement in productivity is always a mistake
- What to do if the team isn't going to have everything done by the deadline?
- Asking them to work longer hours is counter-productive
- So extend the deadline…
- …or cut back the number of stories that are going to be implemented
- This is why you have a customer representative on site
- Resources, features, and schedule are not independent variables
- You can set two, but the third then sets itself
- Comment on this slide
The Down Side
- XP and related methodologies have been criticized almost as widely as they have been adopted
- For every bandwagon there's an equal and opposite anti-bandwagon
- Some of this is just taking pot-shots at easy targets, like:
- “I think maybe concentration is the enemy. Seriously. If you're working on something that is so complex that you actually need to concentrate, there's too much chance that it's too hard.” (Ron Jeffries)
- But some of the criticisms are well-founded
- If developers don't have considerable self-discipline, XP quickly degenerates into “cowboy coding”
- Busy, rather than productive
- Customers often need to know what they're going to get more than a few weeks in advance
- You can't add software to a space probe three days before launch
- Many programmers don't like to work in pairs
- “I trust my account to do my tax return on her own; why don't you trust me to code?”
- XP is intrinsically limited to small teams
- Which in turn means small projects
- Comment on this slide
Summary
- Many people find that their productivity goes up when they adopt XP
- But others find that their productivity goes up when they adopt diametrically opposed processes
- Like the one will see in the next lecture
- Option #1: most people program in the worst possible way, and anything is an improvement
- Option #2: the most important thing isn't which process you adopt, it's that you actually adopt one
- What matters is that you're paying attention to process
- And thinking about how to improve it
- “The unexamined life is not worth living.” (Socrates)
- Comment on this slide
The ICONIX Process
Do It Once, Do It Right
- Extreme Programming is one end of the spectrum
- If your requirements are always perpetually in flux, focus your process on adapting to/coping with change
- This lecture looks at the other end
- Well, one of the other ends—it's not like programmers would ever make anything as simple as a two-ended spectrum
- The ICONIX process was developed by Doug Rosenberg and colleagues
- See
[Rosenberg & Stephens 2005]
for the definitive guide
- A simple alternative to the Rational Unified Process (RUP) and other heavyweight paperwork-oriented processes
- Emphasizes up-front analysis and design
- Reduces total cost and time to delivery dramatically…
- …provided your requirements are relatively stable
- Probably not appropriate for small, exploratory projects
- But probably is if you're writing software that's going into a space probe
- Comment on this slide
The Unified Modeling Language
- Many disciplines rely on graphical notations with precisely-defined semantics for design and communication
- Architectural blueprints
- Circuit diagrams
- Feynman diagrams
- Thousands have been invented for programming, but only a handful have ever gained widespread acceptance
- Flowcharts
![[Flowcharts]](img/iconix/flowchart.png)
Figure 27.1: Flowcharts |
- More useful than advocates of structured programming in the 1970s believed
- See
[Scanlan 1989]
for evidence and discussion
- Entity-relationship diagrams
- Used to design database tables
![[Entity-Relationship Diagrams]](img/iconix/entity_relationship.png)
Figure 27.2: Entity-Relationship Diagrams |
- Class diagrams
![[Class Diagrams]](img/iconix/example_class_diagram.png)
Figure 27.3: Class Diagrams |
- Who inherits from whom?
- Which classes refer directly to which others
- The Unified Modeling Language (UML) combines several notations used in object-oriented analysis and design (OOAD)
- Its three principal authors all had their favorite ways of doing things
- Unlike most designs by committee, this one actually (mostly) makes sense
- UML 2.0 is very large
- The ICONIX process uses only five kinds of diagram
- Four from UML, plus one to bridge an important gap
- Each diagram corresponds to a step in the analysis and design process
- Comment on this slide
From Here to There
- What we have: a prototype of a user interface
-
[Johnson 2000]
will help you avoid horrible mistakes
- If you're trying to get a group talking,
[Powazek 2001]
is essential reading as well
- Draw it on paper (one screenshot per page)
- Use sticky notes and the like for menus, dialogs, etc.
- Test subjects are less likely to hold back criticism of something that looks thrown together
- Write some HTML pages by hand
- Fire up Visual Basic or another drag-and-drop GUI builder
- Make it very clear that this is just a prototype!
- Or your boss will want to ship it to the customer
- What we want: working code
- That does what it's supposed to
- That's been tested
- That we can maintain
![[ICONIX: Start and End Points]](img/iconix/iconix_question.png)
Figure 27.4: ICONIX: Start and End Points |
- Comment on this slide
Use Cases
- First step is to translate our fuzzy ideas about how the interface should be used into something more exact
- A use case is a short point-form story that describes how a user accomplishes a particular task
- Usually written point-form
- Example: Sally wants to run a mitosis simulator for each temperature in a range
- Sally selects “File… Set Simulation Parameters”
- System displays “Simulation Parameters” dialog with default values
- Sally changes “Temperature” from “Single” to “Range”
- System disables temperature entry control for single temperature
- System enables temperature entry controls for “Start”, “Increment”, and “End”
- Sally enters values and clicks “OK”
- If any values illegal or inconsistent, dialog beeps and displays error message explaining the problem
- Otherwise, simulator updates its values, and dialog disappears
- Sally selects “File… Run”
- System enables “Pause” and “Stop” buttons and begins running simulations
- Yes, it's a lot of detail
- But you're going to have to work it all out at some point
- And many studies have shown that writing it down and checking it before you start coding leads to a working program faster
- Also allows QA and technical writing to overlap their work with development
- Note: make sure they're involved in writing test cases
- If QA can't think of a way to test something, fix it
- Ditto if technical writer can't think of a way to explain it
- A use case diagram shows the relationships between
- The actors in the system (i.e., the users and major system components)
- The individual use cases
- Comment on this slide
Domain Model
- At this point, have a rough idea of the major elements of the domain model
- The set of abstractions the system will use to represent the real world
- Example: domain model for a traffic simulator may contain:
- Vehicles
- Vehicle types
- E.g., bicycle, motorcycle, small car, family car, light truck, heavy truck…
- Each vehicle is an instance of exactly one vehicle type
- Drivers
- Driver behaviors
- A set of statistical rules determining speed, likelihood of signalling, etc.
- Typically shared by many drivers
- Road segments
- Intersections
- Each road segments connects to exactly two intersections
- Each intersection connects to one or more road segments
- “One” in order to model dead end streets
- Graphical display and other GUI elements
- Configuration files
- Etc.
- Domain model diagram shows relationships between domain model elements
![[Domain Model Diagram]](img/iconix/iconix_domain_model.png)
Figure 27.5: Domain Model Diagram |
- Inheritance: one element is a specialization of another
- Association: one element refers to another (or two elements refer to each other)
- Usually annotate with multiplicity
- I.e., one-to-one, one-to-many, many-to-many
- If you don't get the domain model right, everything else will go wrong
- Note: writing use cases, and defining a domain model, is always iterative
- Discussion about use cases uncovers new domain model elements
- Discussion about domain model uncovers new use cases
- Use a whiteboard, colored markers, and patience
- Like any notation, domain model diagrams exist to help you think and communicate
- Add comments as appropriate to make meaning clear
- Comment on this slide
Robustness Diagrams
- Re-read the use cases, and classify the objects in the domain model into:
- Boundary objects, which actors use to communicate with the system
- The simulator's input widgets and graphical display
- Entity objects, which are from the core of the domain
- Vehicle types, road segments, etc.
- Control objects, which connect boundary objects to entity objects
- Represent the “intelligence” in the application
- E.g., driver behaviors
- A robustness diagram shows actors' relationships with these objects
![[Robustness Diagram]](img/iconix/iconix_robustness.png)
Figure 27.6: Robustness Diagram |
- The bridge between “what” and “how”
- What's being implemented: use cases and domain model
- How it's being implemented: we'll see in a moment
- Rules make intuitive sense from definitions:
- Actors communicate only with boundary objects
- Boundary objects communicate only with actors and controllers
- Controllers communicate with anything except actors
- Entity objects can only talk to controllers
- These rules are what the graphical notation is for
- Forces us to classify things
- Limits how those things can use each other
- Yes, it's constraining: that's the whole point
- Comment on this slide
Class Diagrams
- A class diagram shows:
- What members and methods each class has
- How classes are derived from each other
- Which classes talk directly to each other
- Either by holding references
- Or by being passed to one another as parameters
- Note:
- Show classes, not individual objects
- Don't show exact nature of relationship (for example, list vs. dictionary)
- Some things are better left to code
- Might seem redundant, given that we already have a domain model diagram
- And given that they use the same notation
- Class diagram (almost) always contains more information than domain model diagram
- Extra objects added to manage relationships between main classes
- What members each class has, what its methods will be called, etc.
- Domain model is for talking to users; class diagram is for talking to developers
- An easy way to spot flaws in a design
- Tangled lines are usually a sign of tangled flow of control
- One big class in the middle that talks to all the others are usually a sign of poor decomposition
- Twenty-seven layers of derivation are usually a sign that you should switch to decaf
- Comment on this slide
Sequence Diagrams
- Last piece of the puzzle is to connect the system's static and dynamic aspects
- A sequence diagram shows what methods objects call to communicate with each other during a particular use case
- Object shown at the top
- Lifespan of method invocation shown by rectangle
- Method invocations shown by arrows
![[Sequence Diagram]](img/iconix/iconix_sequence_diagram.png)
Figure 27.7: Sequence Diagram |
- Note: creating class diagram and sequence diagrams is also iterative
- “Discover” methods as you realize there's no way for X to ask Y to do something
- Sometimes realize that one entity ought to be split into several, or vice versa
- Again, the discussion around the diagrams is as important as the diagrams themselves
- Comment on this slide
Actual Order
- Now that you know what the pieces are, let's put them together
- Phase 1:
- Identify real-world domain objects and their relationships
- Start drawing domain model diagram
- Draw use case diagrams
- Don't write the use cases themselves yet
- Phase 2:
- Start writing use cases
- Create robustness diagram
- Iterate, iterate, iterate
- Update domain model diagram to reflect your discoveries
- Phase 3:
- Draw sequence diagram for each use case
- Create class diagram by combining information from domain model and sequence diagrams
- Phase 4:
- Write code that implements class diagram
- Most important rule: if you change something, update everything upstream from it that refers to it
- E.g., if you add a method while writing code, put it in the class diagram and sequence diagram
- Comment on this slide
A Note on Tools
- Drawing UML diagrams by hand is tedious, error-prone, and counter-productive
- “Counter-productive” because there's no way to automatically check a hand-drawn diagram against your actual code
- Specialized tools to aid in constructing diagrams have been around a long time
- Span the spectrum from:
- Pure modeling tools: help users create diagrams, but don't integrate with code
- Round-trip engineering tools: can translate models into skeletal code, and modified code back into models
- Some integrate with IDEs, support multiple languages, can talk to databases, etc.
ArgoUML is probably the most advanced open source UML tool available
- Most programmers haven't adopted UML modeling tools because:
- The tools are expensive and/or difficult to learn
- If people can't do something useful with a tool in the first hour, they usually stop trying
- But more importantly, they don't find modeling useful
- There's a lot of chicken-and-egg in this
- Until enough programmers create models, and can read them, there's little incentive for any particular programmer to do so
- Comment on this slide
Summary
- Seems like a lot of work, doesn't it?
- But remember:
- You're going to do all of this anyway
- You're just doing it in your head…
- …where it's no use to anyone else
- It takes less total time
- So why don't all large projects use this kind of process?
- Ignorance, laziness, and lack of competitive pressure
- American car makers in the 1970s sold lots of gas-guzzling rust buckets
- Then Volkswagen and Toyota arrived
- It's inappropriate
- Not good at handling rapidly-changing requirements
- Learning curve
- Yes, it'll make you faster on future projects…
- …but learning it will slow you down on this one…
- …and if you don't meet this deadline, there won't be any future projects
- Comment on this slide
The Nevex Process
A Happy Medium
- Very few individuals or teams stick to a textbook process
- A process is just a tool
- You should always adapt tools to local needs…
- …and personalities
- This lecture is a slightly fictionalized account of what Nevex did in 1999-2005
- A computer security startup (acquired by Baltimore, then by Hewlett-Packard)
- Shipped several major versions of a complex product, and many patches
- Met deadlines without lots of late nights and weekends
- Starting point:
- Version 3.2 of GRACE shipped last week
- Version 3.3 has to ship in approximately eight months
- Personnel available are:
- One overall manager
- Nine developers (two junior, four intermediate, three senior)
- One developer support (installers, build, etc.)
- Four quality assurance (QA)
- One technical writer
- One system administrator
- One help desk
- One product manager (marketing, sales support, etc.)
- One secretary/HR/office admin
- Project lifecycle:
![[Project Lifecycle]](img/nevex/nevex_lifecycle.png)
Figure 28.1: Project Lifecycle |
- Goal: meet schedule and quality goals without heroic effort
- No point getting something flaky to the trade show if half the development team quits the week after
- Comment on this slide
All the Vision You'll Ever Need
- Important to have everyone on the same page
- Because common sense isn't
- Exercise: have everyone on the team fill out the following form independently, and compare answers
- Bits in italics are typical answers: have your team members replace them with their answers
| (Problem statement) | The problem of | not being able to access the BIND molecular database remotely |
| (Target market) | affects | scientists studying protein interactions. |
| (Impact) | The impact is that | they can only inspect/update single records at a time via a browser. |
| (Solution) | Our solution is to | let programmers search and update the database programmatically using multiple languages. |
| (Specific audience) | For | programmers |
| who want to | incorporate BIND into a larger application |
| (Technical details) | our product is a | web services API based on SOAP |
| that | supports remote calls into the core BIND API. |
| (Competitive comparison) | Unlike | screen-scraping |
| our product | can be used directly and reliably in many languages. |
- Comment on this slide
Analysis & Estimation
- “Plans are useless, but planning is invaluable.” (Dwight D. Eisenhower)
- Product manager gathers requirements from customers
- Customers often say, “We want feature X.”
- Often means, “I heard someone mention this while I was waiting for a flight.”
- Product manager turns that into, “We want the product to be able to do Y.”
- What do customers want to be able to do that they can't right now?
- What does the product make them do that they don't want to have to?
- Developers analyze the feature requests provided by the product manager
- How can the capability be implemented?
- Often involves researching new technologies, competitors' products, etc.
- What are the risks and unknowns?
- Best way to find out is often to build some throwaway code
- This goes under version control like anything else…
- …but should not then be “grown” into the product
- How long will it take?
- We'll return to this topic below
- Usually takes 3-6 weeks to do A&E
- What's everyone else doing while this is going on?
- Providing estimates of their own, of course
- Every feature must be tested and documented
- Must also be added to the daily build, and to the installer
- So QA, documentation, and developer support need to add time estimates to the A&E too
- When all A&E's are done, the product manager collects the estimates
- Total time is (for example) 9000 work-days
- But only 2000 work-days are available
- Fifteen people in development, QA, tech writing, and technical support
- Eight months of five days a week, minus statutory holidays
- Minus 20% for sick days, emergency bug fixes for existing customers, etc.
- No team ever works at 100% capacity, so don't schedule that way!
- So now the product manager has to prioritize
- Which features are most cost-effective to develop?
- I.e., will satisfy the most customer needs for the least effort?
- Developers don't know enough (don't have enough customer contact) to make this decision
- Usual way to do this is to build a 3×3 grid
- Rank each feature low-medium-high on importance and time required
- More honest than the false accuracy of a 1-10 scale
![[Feature Rankings]](img/nevex/feature_ranking.png)
Figure 28.2: Feature Rankings |
- Product manager is not allowed to shave developers' time estimates
- If she is, developers will start padding their estimates…
- …or supplying random numbers, secure in the knowledge that they won't be able make the deadline anyway
- Letting managers make up “science fiction schedules” is one of the most common mistakes companies make
- The other is not doing enough testing early on
- Yes, customers (and your marketing department) will complain if their feature doesn't make it onto the schedule
- But putting it in the schedule when you know that schedule is fiction doesn't improve the chances of it getting implemented on time
- It's better to live up to small promises than break big ones
- End result of A&E process is a schedule that everyone believes is achievable
- Yes, it contains a lot of guesswork…
- …but it's better than nothing…
- …and estimates improve with practice
- Comment on this slide
What Goes Into an A&E
- No template is ideal for all A&E's
- Every A&E is different, because every problem is different
- Requiring team members to express their ideas in a fixed way just leads to fewer ideas
- Typical A&E has these sections:
- Title, author(s), revision history
- Executive summary (abstract)
- One paragraph
- Tells someone browsing through dozens of A&E's whether this is one they should read
- Background
- Aim is to provide a context for evaluating decisions
- Introduce the problem to a reasonably knowledgeable developer
- Don't explain what the Internet is…
- Imagine writing for who you were when you started the project
- User stories
- What will people be able to do (or do better) when this feature is implemented?
- Often the starting point from which QA develops test cases, and the technical writer develops documentation
- Often find there's an easy way to give customers half of what they want, and a harder way to do it all
- Estimate both, and discuss pros and cons…
- …because the product manager may decide to go with the cheap 50% solution when putting the schedule together
- “Not everything worth doing is worth doing well.”
- For each alternative:
- How long will it take to create? To test? To document? To add to the build and installer?
- What impact will it have on other features?
- How risky is it (i.e., how certain are you of your estimate)?
- Do not gloss over the hard bits
- This is when you do your hard thinking (while you're still fresh)!
- “We parse the XML configuration file, and then a miracle happens” is not a good A&E
- References
- These days, usually a list of URLs
- Include pointers to prototypes stored in the version control repository
- Time estimates come from experience
- You can usually estimate how much code you'll have to write to implement something
- If you can't, you should go back and think it through in more detail
- You should know how long it takes you to develop working code
- Just as runners keep track of their times for running 1000 meters, or how much they can bench press
- The most elaborate version of this is Watts Humphrey's Personal Software Process (PSP)
- In practice, enough to keep track of how many days you spend on each module you write…
- …where “days spent” includes debugging time
- Your first estimates will be wildly wrong
- But the more you estimate, the better your estimates will become
- Expect to revise this document several times
- Project manager assigns task of reviewing A&E to specific developer(s)
- Ensures that reviews actually get done…
- …and that knowledge and decisions are communicated
- Comment on this slide
Getting It Built
- Developers then finish designs
- Purpose is to convince people (including themselves) that they can actually build things the way they intend to
- Very easy to believe incompatible things when it's all in your head
- The act of writing it down (explaining it) forces brings these issues out into the open…
- …and ensures that other people know what you think they think you think
- Also helps you remember later why you're doing something a particular way
- Note that design overlaps analysis
- After all, it's impossible to estimate how long it will take to build something if you don't know how you're going to build it
- Level of detail varies widely
- The better developers understand the product and the technologies it uses, the less risk there is of a lightweight design leading to nasty surprises
- The greater the cost of delivering the wrong thing, the more detailed the design should be
- But it doesn't help to do a detailed design that customers can't understand
- Up-front design is a tool for managing risk
- The later you change or fix something, the more expensive it is
- But the more time you invest up front, the longer it is until you discover that you need to make changes
- Write the code
- Also overlaps analysis and design
- Build throwaway prototypes while doing analysis
- Expect to refine design during early stages of construction
- If you're still refining the design a week before you're due to ship, something has gone wrong
- Remember that coding includes writing unit tests
- These are the developer's responsibility
- Include coverage analysis in the daily build & test to ensure that new code is being tested
- In theory, QA should refuse to test features that aren't exercised by unit tests
- Often refactor old code while adding new stuff
- Your skills (and coding style) improve over time
- Or the person working on the feature in Version 3.3 knows something the Version 3.2 author didn't
- The problem changes over time
- A good solution to last year's requirements may not be a good solution to this year's
- As features are completed, update the design to reflect what was actually built
- If you don't, then:
- The next person who has to work with the code will have a harder time
- You'll have to start from further behind the next time you need to modify this section
- Making sure this happens is the project manager's job
- Maintenance usually starts before the product goes out the door
- Fix bugs left over from the previous cycle while adding features in this one
- Start fixing bugs in new features as QA discovers them
- Last quarter of a project cycle is usually nothing but bug fixing
- Comment on this slide
Other Activities
- Quality assurance starts at the same time as development
- As soon as feature specifications are firm, QA can:
- Start making up lists of tests to run
- Start writing utility programs
- And of course, there's always bug fixes to the previous version to test…
- QA's usual role is to find bugs in the finished product
- Yes, developers should do this as they're coding…
- …but in practice, they are usually too focused on particular features to find wider issues
- Developers also tend not to be good at testing
- Subconsciously don't want to find problems
- Focus on the parts they're most familiar with, rather than the product as a whole
- QA's real job is to figure out why bugs are happening
- “Trying to improve the quality of software by doing more testing is like trying to lose weight by weighing yourself more often.” (Steve McConnell)
- So that developers stop making that kind of mistake
- Documentation also starts before product is available
- Describe new GUI screens based on mock-ups drawn for A&E
- Write new on-line help
- If the technical writer can't figure out how to explain a feature, the developer should redesign it
- Developer support starts work on installers the same day developers start on new features
- Makes sure the daily build & test keeps running
- And chastises developers who've broken things
- Everyone has a status report in their mailbox when they arrive each morning
- Schedule is always up to date
- Developers provide progress reports every week (at least)
- Doing this at 9:00 a.m. Monday works better than asking for it at 4:45 on Friday
- Always based on binary milestones
- If you allow percentages, then many tasks will be 90% done for 90% of the lifetime of the project
- Instead, break tasks down into subtasks that are at most a few days long, and either completed or not completed
- Describe milestones in terms of verifiable deliverables (i.e., things that someone else can check)
- The real purpose of a schedule is to help the project manager cut corners when things go wrong
- How far off track are we?
- What can we cut (or scale back) to get on track?
- Comment on this slide
After the Party's Over
- The last thing you do is write the release log and assign the release a version number
- Leave this until the end because:
- You may discover that what you thought was going to be one release actually needs to be several
- Marketing may change its mind based on what the competition is calling their releases
- Refer to the release by a codename (like “Beryllium” or “Munich”) while it's under development
- Always do a post mortem on the project after shipping
- What went right (that we want to do again)?
- What went wrong (that we want to avoid next time)?
- Usually best to bring in a neutral outsider to facilitate
- Feedback is only as useful as it is honest
- And then start the next round of A&E's
- Because while the development team has been wrapping up this release, the product manager has been preparing for the next one
- Comment on this slide
Exercises
Exercise 28.1:
Does your manager know when you expect to complete your
current task? How inaccurate the schedule currently is?
Exercise 28.2:
Can you find out when your manager expects you to complete
your current task (without asking her directly)? When team members
expect to complete their current tasks (without asking them directly)?
Who would be affected if you slipped a week?
Backward, Forward, and Sideways
And Now We May Begin
- From one perspective, this course has just scratched the surface
- From another, it has introduced you to the 20% of software engineering that accounts for 80% of what professional developers do
- Techniques
- Supported by tools
- In this lecture, look at a few of the next steps you might want to take
- Comment on this slide
Numerical Programming
- In the beginning there was Fortran
- A great leap forward
- But not really mathematics
- Then, in the late 1960s, APL
- Started as a notation for linear algebra
- Turned out it could be implemented on a computer
- “Programming in mathematics”
- If you read Greek backwards
- See
http://www.users.cloud9.net/~bradmcc/APL.html for the scary details
- MATLAB emerged in the early 1980s as a scripting tool for BLAS (the Basic Linear Algebra Subroutines)
- High-level language for pushing Fortran arrays around
- Graphics, OO, and many other capabilities added over the years
- Python's equivalent is the
Numeric module (sometimes referred to as NumPy)
- A high-performance C library wrapped in a high-level interface
- Core package provides efficient multidimensional arrays
- Much slower than compiled C or Fortran…
- …but much easier to write…
- …and much faster than pure Python
- Add-ons provide random number generation, FFTs, and linear algebra
- Forms the basis of other packages
matplotlib for MATLAB-style plotting- PIL (Python Imaging Library) for image processing
- NumPy is being replaced with a new
numarray module- Extra hands always welcome
- Also worth looking at
SciPy
- A collection of scientific utilities built on top of Numpy
- Comment on this slide
Imaging
- The
Python Imaging Library is a toolkit for working with images in just about every format you'll ever encounter- Scale, crop, overlay, merge, etc.
- Open source, but commercial support is available
- Example: reduce the amount of green in an image, but only where there isn't a lot of red
import sys, Image
# Color bands in image.
RED, GREEN, BLUE = 0, 1, 2
# Read image.
im = Image.open(sys.argv[1])
# Split the image into individual bands.
source = im.split()
# Create a bitmask of regions where there's red.
def lessThan100(x):
if x > 128:
return x
else:
return 0
mask = source[RED].point(lessThan100)
# Process the green band.
def reduceByHalf(x):
return x * 0.75
out = source[GREEN].point(reduceByHalf)
# Paste the processed band back, but only where red was < 100.
source[GREEN].paste(out, None, mask)
# Build a new multiband image.
im = Image.merge(im.mode, source)
# Save for display.
im.save(sys.argv[2])
![[Original Image]](img/summary/pil_input.jpg)
Figure 29.1: Original Image |
![[Processed Image]](img/summary/pil_output.jpg)
Figure 29.2: Processed Image |
- Note: if you're looking for a professional-quality image editing tool, try
GIMP
- It's complex, but also powerful
- Comment on this slide
A Better Way to Build
- Said back in Automated Builds that Make has turned into a small programming language
- A very clumsy one at that: confusing syntax, no debugger, …
- The same thing happens eventually to every other build management tool
- So why not start with a real programming language, and embed a build management tool in that?
SCons combines the most useful features of Make with the full power of Python- Instead of a Makefile, you write an SConstruct file
- Use function calls in that file to tell SCons what you want to build, and how
- Like Make, it has a rich set of default rules
- But when you want to do something complicated, you can use any feature of Python you want
- E.g., build up a list of filenames, fetch a build rule from a database, etc.
- Example: build either a normal or a debugging version of a program, and include an extra source file on Windows
# What does the program depend on?
dependencies = ['file1.c', 'file2.c']
if os.platform == 'win32':
dependencies.append('win32.c')
# Which version are we building?
if 'debug' in COMMAND_LINE_TARGETS:
Program('hello_dbg', dependencies)
else:
Program('hello', dependencies)
- Pro
- Don't have to learn a new language
- And the authors of the build tool don't have to create and maintain one, either
- There's a debugger
- Con
- Much less widely used than Make or Ant (which is the Java equivalent of Make)
- Comment on this slide
Integrating with C
- Been saying since the first lecture that you should:
- Write the first version in Python (or some other very high-level language)
- Find out whether it's fast enough, and if not, what's slowing it down
- Optimize only those parts that you need to
- The most effective way to optimize programs written in high-level languages like Python and MATLAB is often to rewrite core modules in C (or some other low-level language)
- Also a good way to handle inherited code: wrapping tried-and-trusted C or Fortran in Python is safer and easier than rewriting
- And faster
- If you don't speak C,
[Kernighan & Ritchie 1998]
is the standard introduction
- Python represents things using a C structure of type
PyObject that contains:
- A type code (so that the interpreter knows how to interpret the rest of the structure)
- A union large enough to hold any basic value
- In Python's case, the two 64-bit values needed to store a complex number
- A reference count to keep track of how many other objects are pointing at this thing
- When a thing is created, its reference count is initialized to 1
- When its count drops to 0, Python can garbage collect it
![[How Python Represents Values]](img/summary/pyobject.png)
Figure 29.3: How Python Represents Values |
- Every C function that the interpreter calls must take two arguments:
self is NULL for pure functions, and an object for methodsargs is a variable-length list of arguments- Use the function
PyArg_ParseTuple to extract arguments' values - I.e., get the raw integer out of the first, a string pointer out of the second, etc.
- Example: take an integer and return three times its value
- Not particularly exciting, but it illustrates the main points
/* Triple an integer value. */
static PyObject * triple(PyObject * self, PyObject * args)
{
int val;
if (!PyArg_ParseTuple(args, "i", &val)) {
return NULL;
}
val = val * 3;
return Py_BuildValue("i", val);
}
- Returns
NULL to signal error - Otherwise, uses
Py_BuildValue to build a Python structure with the result value
- Need some boilerplate to bring this function to the Python interpreter's attention
/* Table of module contents (handed back to Python at initialization). */
static PyMethodDef contents[] = {
{"triple", triple, METH_VARARGS},
{NULL, NULL}
};
/* Initialization function. */
void inittriple()
{
Py_InitModule("triple", contents);
}
- The array
contents has one entry for each function - The initialization function:
- Has a name of the form
initXYZ for the module XYZ - Calls
Py_InitModule to pass the table of module contents to the interpreter
- To use:
- Compile the C to create a shared library (
.dll on Windows, .so on Unix) - Copy the result to a directory that's on Python's search path
- Then
import triple and call triple.triple(11)
- Comment on this slide
Integrating the Other Way
- It's simpler with SWIG
- Simple Wrapper Interface Generator
- Interprets specially-formatted comments in C header files to generate Python wrappers
- Similar tools exist to connect Python to Fortran
- Can also go the other way, and embed Python in C/C++
- Every large application eventually needs an interactive command interpreter
- To embed:
- Initialize a Python interpreter object
- Convert application values into Python objects
- Pass the interpreter a string containing the code to be executed, and the values to execute it on
- Unwrap the result
- Multilanguage programming isn't simple
- And multilanguage debugging is downright hard
- But both are often simpler (as well as more efficient) than the alternatives
- Comment on this slide
Design Patterns
- A design pattern is a way of doing things
- E.g., porches, locked-room mysteries, and nested loops
- Too fuzzy to put in a library, but a great way to talk about design issues
- The architect Christopher Alexander invented the term
-
[Gamma et al 1995]
applied the notion to object-oriented programming
- Now dozens of books, conferences, etc.
- Easiest way to explain is by example
- Singleton
- Synopsis: ensure that only one instance of a class is created
- Context: want every object that uses a class to use the same instance of that class
- E.g., the graphics driver for the screen
- Forces:
- There must be exactly one instance of the class
- All objects that use the class must have access to that instance
- Solution:
- Make the constructor private so that other classes can't get at it
- Provide a static method for construction
- Have that method keep a copy of the first object it creates, and re-use it on subsequent calls
- Filter
- Synopsis: dynamically connect objects to transform a data stream
- Context: some computations most easily expressed as a sequence of transformations
- E.g., image processing, or Unix commands joined by pipes
- Allows uses to mix and match operations easily
- Forces:
- Every filter should read and write the same format, so that they can be chained together
- It should be possible to chain filters together in arbitrary orders
- Solution:
- Have each filter implement an interface that reads and writes data
- Define special cases for sources (that only write) and sinks (that only read)
- Provide filters with a way to signal that they have nothing more to produce
- Visitor
- Synopsis: separate traversal of a complex structure from operations on that structure
- Context: want an easy way to do walk around a complex structure
- E.g., visit each node in a graph once
- Forces:
- Many different operations may need to be performed
- Structure is complex enough that visiting elements is error-prone
- The types of objects in the structure, and the ways they are connected, change infrequently
- Solution #1: create an iterator class with do-nothing methods that are called at specific times
- This class takes care of things like remembering which nodes have already been seen
- Users derive their own iterators from this, filling in methods to do what they want
- Solution #2: create one iterator class that takes a callback argument
- Callback only provides methods that do things
- Iterator calls these at appropriate times
- Solution #1 is simpler, but:
- Some languages (like C) don't support derivation, so you have to use solution #2
- Solution #2 allows you to re-use callbacks with different kinds of collections
- Not just about object-oriented design
- User interface patterns
- Business patterns
- Anti-patterns (things to avoid)
- In practice, serve two purposes:
- Communication: a concise way for designers to communicate with each other
- And argue out exactly what they mean
- Education: gives them a way to communicate what they know to newcomers
- Comment on this slide
Refactoring
- Refactoring is the process of changing (hopefully, improving) the design of existing code
- Clean it up to make changes easier (or possible)
- Term popularized by Martin Fowler
- Often described in terms of “code smells” and corresponding cures
- Equivalent to an anti-pattern and an algorithm for fixing it
-
[Fowler 1999]
is a catalog of refactorings
- Design patterns describe code at rest; refactorings describe code in motion
- Smell: method or function body runs to several pages
- Cure: use Extract Method to break the method into smaller meaningful pieces
- Do not break up arbitrarily, just to satisfy coding conventions
- Each new method should make sense on its own
- Smell: function or method with many parameters
- If a method takes eleven strings as parameters, sooner or later you'll pass them in the wrong order
- Cure: store values as members of the object, rather than passing them as parameters
- But it's bad to introduce members called
param1 and param2
- Alternative cure: use the
Introduce Parameter Object to combine co-occurring parameters into one value
- Smell: duplicated code
- Cure: use Extract Method once again, of course
- But what if code is only almost duplicated?
- Use Introduce Parameter to give callers a way to signal exactly what they want…
- …or use Pull Up Method to move shared code into parent…
- …and Form Template Method to have that shared code call something that each child class defines
- Smell: complex Boolean expressions in conditionals
- Cure: Introduce Explaining Variable to give sub-expressions meaningful names
- Particularly effective when it's used to turn nested if-then-elses into a lookup table
- Same benefits as patterns: communication shorthand, and education
- But there's one big difference: it's possible to build tools to do refactoring for you
- The Eclipse IDE for Java is the best known example
- Highlight a method, say “Rename”, and the IDE changes all the calls too
- Able to do this because it continuously re-parses source code as you type
- Move methods up into parent class, split classes in two, and much, much more
- If this doesn't convince you to upgrade from a dumb editor, I don't know what will…
- Very important to have unit tests in place before starting to refactor
- Without this, you have no way of knowing what else your refactoring might have broken
-
[Feathers 2005]
is an excellent guide to how to fit useful tests back onto inherited applications
- Comment on this slide
And a Little Light Reading
-
[Brand 1995]
looks at how buildings can be designed to change and grow gracefully over time
- Everything he says is directly applicable to large programs
-
[Steele 1999]
shows you how a great computer scientist thinks about a particularly hard problem
-
[Margolis & Fisher 2002]
describes work at Carnegie-Mellon University aimed at making computer science a less hostile environment for women and other underrepresented groups
- Note: while the gender ratio in computing as a whole is about six to one, the ratio in open source is closer to 200:1
Internet Groupware for Scientific Collaboration talks about how much more the web could do for scientists- Comment on this slide
The Rules
- A week of hard work can sometimes save you an hour of thought.
- If it's worth doing again, it's worth automating.
- Anything repeated in two or more places will eventually be wrong in at least one.
- The three chief virtues of a programmer are laziness, impatience, and hubris.
- It's not what you know, it's what you can.
- The deadline isn't when you're supposed to finish; the deadline is when it starts to be late.
- Never debug standing up.
- Tools are signposts, not destinations.
- Not everything worth doing is worth doing well.
- Code unto others as you would have others code unto you.
- Every complex file format eventually turns into a badly-designed programming language.
- Tools are amplifiers: they allow good programmers to be better, and bad ones to be worse.
- They call it computer science because it's experimental.
- Programs come and go; data is forever.
- There's no such thing as one program.
- Discipline matters more than genius.
- Comment on this slide
Conclusion
- “A good teacher has a value above pearls, but a good student has a value above rubies.”
- Thank you, and good luck
- Comment on this slide
Bibliography
 | [Agans 2002] | David J. Agans:
Debugging.
American Management Association,
2002,
0814471684.
|
|
Its first sentence says, “This book tells you how to find out
what's wrong with stuff, quick,” and that's exactly what it
does. In fifteen (very) short chapters, the author presents nine
simple rules to help you track down and fix problems in software,
hardware, or anything else. His war stories are entertaining
(although I think one or two are urban myths), and his advice is
eminently practical.
|
 | [Brand 1995] | Stewart Brand:
How Buildings Learn.
Penguin USA,
1995,
0140139966.
|
|
This beautiful, thought-provoking book starts with the observation
that most architects spend their time re-working or extending
existing buildings, rather than creating new ones from scratch. Of
course, if Brand had written “program” instead of
“building”, and “programmer” where he'd
written “architect”, everything he said would have been
true of computing as well. A lot of software engineering books try
to convey the same message about allowing for change, but few do it
so successfully. By presenting examples ranging from the MIT Media
Lab to a one-room extension to a house, Brand encourages us to see
patterns in the way buildings change (or, to adopt Brand's metaphor,
the way buildings learn from their environment and from
use). Concurrently, he uses those insights to argue that since
buildings are always going to be modified, they should be designed
to accommodate unanticipated change.
|
 | [Castro 2002] | Elizabeth Castro:
HTML for the World Wide Web.
Peachpit Press,
2000,
0321130073.
|
|
A clean, clear, comprehensive guide to creating HTML for the web,
with good coverage of Cascading Style Sheets (CSS).
|
 | [Castro 2000] | Elizabeth Castro:
XML for the World Wide Web.
Peachpit Press,
2000,
0201710986.
|
|
Like other books in Peachpit's Visual Quickstart series, this one is
beautifully designed, and easy to read without ever being
condescending. Its 16 chapters and 4 appendices are organized into
1- and 2-page explanations of particular topics, from writing
non-empty elements to namespaces, schemas, and XML
transformation. Throughout, Castro strikes a perfect balance between
“what”, “why”, and “how”, and
provides a surprising amount of detail without ever overwhelming the
reader.
|
 | [Chase & Simon 1973] | W.G. Chase and H.A. Simon:
"Perception in chess",
Cognitive Psychology,
vol. 4,
no. ,
pp. 55-81,
1973.
|
|
The original paper comparing the performance of novice and master
chess players when confronted with actual and random positions.
|
 | [Clark 2004] | Mike Clark:
Pragmatic Project Automation.
Pragmatic Bookshelf,
2004,
0974514039.
|
|
This entry in the Pragmatic Bookshelf series focuses on getting your
project to build itself, and (more importantly) tell you how the
build went, automatically. Clark doesn't confine himself to running
Make at 3:00 a.m.; he also covers ways of automatically re-running
tests, building and testing installers, monitoring applications, and
more.
|
 | [DeMarco & Lister 1999] | Tom DeMarco and Timothy Lister:
Peopleware.
Dorset House,
1999,
0932633439.
|
|
This was the first book I ever read that said that the leading cause
of software project failure was people, rather than technology.
Using anecdotes, humor, and common sense, DeMarco and Lister explain
how important good physical space, aligning authority with
responsibility, and clear direction are.
|
 | [Fehily 2003] | Chris Fehily:
SQL.
Peachpit Press,
2003,
0321118030.
|
|
This very readable book describes the 5% of SQL that covers 95% of
real-world needs. While the book moves a little slowly in some
places, the examples are exceptionally clear.
|
 | [Feldman 1979] | Stuart I. Feldman:
"Make---A Program for Maintaining Computer Programs",
Software: Practice and Experience,
vol. 9,
no. 4,
pp. 255-265,
1979.
|
|
The original description of Make. Last time I checked, Stu Feldman
was a vice president at IBM, which shows you just how far a good
tool can take you…
|
 | [Feathers 2005] | Michael C. Feathers:
Working Effectively with Legacy Code.
Prentice-Hall PTR,
2005,
0131177052.
|
|
Most programmers spend most of their time fixing bugs, porting to
new platforms, adding new features---in short, changing existing
code. If that code is exercised by unit tests, then changes can
be made quickly and safely; if it isn't, they can't, so your first
job when you inherit legacy code should be to write some. That's
where this book comes in. What to know three different ways to
inject a test into a C++ class without changing the code? They're
here. Want to know which classes or methods to focus testing on?
Read his discussion of pinch points. Need to break inter-class
dependencies in Java so that you can test one module without
having to configure the entire application? That's in here too,
along with dozens of other useful bits of information. Everything
is illustrated with small examples, all of them clearly explained
and to the point. There are lots of simple diagrams, and a short
glossary; all that's missing is hype.
|
 | [Fowler 1999] | Martin Fowler:
Refactoring.
Addison-Wesley Professional,
1999,
0201485672.
|
|
Like architects, most programmers spend most of their time
renovating, rather than creating something completely new on a blank
sheet of paper. This book presents and analyzes patterns that come
up again and again when programs are being reorganized. Some of
these are well-known, such as placing common code in a utility
method. Others, such as replacing temporary objects with queries, or
replacing constructors with factory methods, are subtler, but no
less important. Each entry includes a section on motivation, the
mechanics of actually carrying out the transformation, and an
example in Java.
|
 | [Friedl 2002] | Jeffrey E. F. Friedl:
Mastering Regular Expressions.
O'Reilly,
2002,
0596002890.
|
|
The definitive programmer's guide to regular expressions.
|
 | [Gamma et al 1995] | Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides:
Design Patterns.
Addison-Wesley,
1995,
0201633612.
|
|
The book that started the software design patterns movement. Much
of the discussion has been superseded by more recent books, and the
use of C++ and Smalltalk for examples feels a little dated, but it
is still a landmark in programming.
|
 | [Glass 2002] | Robert L. Glass:
Facts and Fallacies of Software Engineering.
Addison-Wesley Professional,
2002,
0321117425.
|
|
I really wish someone had given me something like this book when I
took my first programming job. If nothing else, it would have been a
better way to start thinking about the profession I had stumbled
into than the “everybody knows” factoids that I soaked
up at coffee time. Some of what he says is well-known: good
programmers are up to N times better than bad ones (his value for N
is 28), reusable components are three times harder to build than
non-reusable ones, and so on. Other facts aren't part of the
zeitgeist, though they should be. For example, most of us know that
maintenance consumes 40-80% of software costs, but did you know that
roughly 60% of that is enhancements, rather than bug fixes? Or that
if more than 20-25% of a component has to be modified, it is more
efficient to re-write it from scratch? Best of all, Glass backs up
every statement he makes with copious references to the primary
literature; if you still disagree with him, you'd better be sure you
have as much evidence for your point of view as he has for his.
|
 | [Goerzen 2004] | John Goerzen:
Foundations of Python Network Programming.
APress,
2004,
1590593715.
|
|
This book looks at how to handle several common protocols, including
HTTP, SMTP, and FTP. Goerzen also doesn't delve as deeply into their
internals, but instead on how to build clients that use them. His
approach is to build solutions to complex problems one step at a
time, explaining each addition or modification along the way. He
occasionally assumes more background knowledge than most readers of
this book are likely to have, but only occasionally, and makes up
for it by providing both clear code, and clear explanations of why
this particular function has to do things in a particular order, or
why that one really ought to be multithreaded.
|
 | [Good 2005] | Nathan A. Good:
Regular Expression Recipes.
APress,
2005,
159059441X.
|
|
A great how-to for regular expressions, with examples in many
different languages.
|
 | [Gunderloy 2004] | Mike Gunderloy:
Coder to Developer.
Sybex,
2004,
078214327X.
|
|
This practical, readable book is subtitled “Tools and
Strategies for Delivering Your Software”, and that's
exactly what it's about. Project planning, source code control,
unit testing, logging, and build management are all
there. Importantly, so are newer topics, like building plugins
for your IDE, code generation, and things you can do to protect
your intellectual property. Everything is clearly explained, and
illustrated with well-chosen examples. While the focus is
definitely on .NET, Gunderloy covers a wide range of other
technologies, both proprietary and open source. I'm already
using two new tools based on references from this book, and plan
to make the chapter on “Working with Small Teams”
required reading for my students.
|
 | [Harold 2004] | Elliotte Rusty Harold:
Effective XML.
Addison-Wesley,
2004,
0321150406.
|
|
This book explains which of XML's many features should be used when:
Item 12 tells you to store metadata in attributes, and then spends
six pages explaining why, while Item 24 analyzes the strengths and
weaknesses of various schema languages, and Item 38 covers character
set encodings. It's more than most developers will ever want to
know, but when you need it, you really need it.
|
 | [Hock 2004] | Roger R. Hock:
Forty Studies that Changed Psychology.
Prentice Hall,
2004,
0131147293.
|
|
In forty short chapters, Hock describes the turning points in
our understanding of how our minds work. The book isn't just
about psychology; you'll also learn a lot about how science gets
done, and about the scientists who do it.
|
 | [Hunt & Thomas 1999] | Andrew Hunt and David Thomas:
The Pragmatic Programmer.
Addison-Wesley,
1999,
020161622X.
|
|
This book is about those things that make up the difference between
typing in code that compiles, and writing software that reliably
does what it's supposed to. Topics range from gathering requirements
through design, to the mechanics of coding, testing, and delivering
a finished product. The second section, for example, covers
“The Evils of Duplication”, “Orthogonality”,
“Reversibility”, “Tracer Bullets”,
“Prototypes and Post-It Notes”, and “Domain
Languages”, and illuminates each with plenty of examples and
short exercises.
|
 | [Johnson 2000] | Jeff Johnson:
GUI Bloopers.
Morgan Kaufmann,
2000,
1558605827.
|
|
Most books on GUI design are long on well-meaning aesthetic
principles, but short on examples of what it means to put those
principles into practice. In contrast, GUI Bloopers presents case
study after case study: what's wrong with this dialog? What should
its creators have done instead. And, most importantly, why? The net
effect is to teach all of the same principles that other books try
to, but in a grounded, understandable way.
|
 | [Kernighan & Pike 1984] | Brian W. Kernighan and Rob Pike:
The Unix Programming Environment.
Prentice Hall,
1984,
013937681X.
|
|
I have long believed that this book is the real secret to Unix's
success. It doesn't just show readers how to use Unix---it explains
why the operating system is built that way, and how its "lots
of little tools" philosophy keeps simple tasks simple, while making
hard ones doable.
|
 | [Kernighan & Ritchie 1998] | Brian W. Kernighan and Dennis Ritchie:
The C Programming Language.
Prentice Hall PTR,
1998,
0131103628.
|
|
The classic description of the one programming language every
serious programmer absolutely, positively has to learn.
|
 | [Knuth 1998] | Donald E. Knuth:
The Art of Programming.
Addison-Wesley,
1998,
0201485419.
|
|
The lifework of the man who invented many of the basic concepts of
algorithm analysis, these massive tomes are like Everest:
awe-inspiring, but not for the weak of heart. Most readers will
find
[Sedgewick 2001]
much more approachable.
|
 | [Langtangen 2004] | Hans P. Langtangen:
Python Scripting for Computational Science.
Springer-Verlag,
2004,
3540435085.
|
|
The book's aim is to show scientists and engineers with little
formal training in programming how Python can make their lives
better. Regular expressions, numerical arrays, persistence, the
basics of GUI and web programming, interfacing to C, C++, and
Fortran: it's all here, along with hundreds of short example
programs. Some readers may be intimidated by the book's weight, and
the dense page layout, but what really made me blink was that I
didn't find a single typo or error. It's a great achievement, and a
great resource for anyone doing scientific programming.
|
 | [Lutz & Ascher 2003] | Mark Lutz and David Ascher:
Learning Python.
O'Reilly,
2003,
0596002815.
|
|
This is not only the best introduction to Python on the market, it
is one of the best introductions to any programming language that I
have ever read. Lutz and Ascher cover the entire core of the
language, and enough of its advanced features and libraries to give
readers a feeling for just how powerful Python is. In keeping with
the spirit of the language itself, their writing is clear, their
explanations lucid, and their examples well chosen.
|
 | [Margolis & Fisher 2002] | Jane Margolis and Allan Fisher:
Unlocking the Clubhouse.
MIT Press,
2002,
0262133989.
|
|
This book describes a project at Carnegie-Mellon University that
tried to figure out why so few women become programmers, and what
can be done to correct the imbalance. Its first six chapters
describe the many small ways in which we are all, male and female,
are conditioned to believe that computers are "boy's things".
Sometimes it's as simple as putting the computer in the boy's
room, because "he's the one who uses it most". Later on, the "who
needs a social life?" atmosphere of undergraduate computer labs
drives many women away (and many men, too). The last two chapters
describe what the authors have done to remedy the situation at
high schools and university. This work proves that by being
conscious of the many things that turn women off computing, and by
viewing computer science from different angles, we can attract a
broader cross-section of society, which can only make our
discipline a better place to be. The results are impressive:
female undergraduate enrolment at CMU rose by more than a factor of
four during their work, while the proportion of women dropping out
decreased significantly.
|
 | [Martelli 2005] | Alex Martelli, Anna Ravenscroft, and David Ascher:
Python Cookbook.
O'Reilly,
2005,
0596007973.
|
|
A useful reference for every serious Python programmer, this book is
a collection of tips and tricks, some very simple, others so complex
that they require careful line-by-line reading. The book's companion
web site is updated regularly.
|
 | [Mason 2005] | Mike Mason:
Pragmatic Version Control Using Subversion.
Pragmatic Bookshelf,
2005,
0974514063.
|
|
Yet another book from the folks at Pragmatic, this one is everything
you'll ever need to know about Subversion, which is on its way to
becoming the version control system of choice for open source
development.
|
 | [McConnell 2004] | Steve McConnell:
Code Complete.
Microsoft Press,
2004,
0735619670.
|
|
This classic is a handbook of do's and don'ts for working
programmers. It covers everything from how to avoid common mistakes
in C to how to set up a testing framework, how to organize
multi-platform builds, and how to coordinate the members of a team.
In short, it is everything I wished someone had told me
before I started my first full-time programming job.
|
 | [McConnell 1996] | Steve McConnell:
Rapid Development.
Microsoft Press,
1996,
1556159005.
|
|
This book describes what it takes to develop robust code quickly,
what mistakes are often made in the name of rapid development, and
how to identify and analyze potential risks. It includes a list of
25 best practices, and discusses things that most other books leave
out (like recovering from disasters and dealing with impossible
demands). Unlike most “how to do it better” books, it
isn't try to sell any particular practice or style, which adds even
more weight to McConnell's carefully balanced opinions.
|
 | [McConnell 1997] | Steve McConnell:
Software Project Survival Guide.
Microsoft Press,
1997,
1572316217.
|
|
A condensed manager-level version of the same author's Rapid Development.
|
 | [Pilgrim 2004] | Mark Pilgrim:
Dive Into Python.
APress,
2004,
1590593561.
|
A good introduction to Python, which is also available on-line at
Dive Into Python.
|
 | [Powazek 2001] | Derek M. Powazek:
Design for Community.
New Riders,
2001,
0735710759.
|
|
This book isn't about web logging, streaming video, or managing
mailing lists. Instead, it's about how to structure web sites so
that they will foster on-line communities. The writing is personal
without being sappy or overbearing, and the author draws upon a
wealth of personal experience to explain why you sometimes don't
want to make it easy for people to post comments, or how best to
deal with abusive posters. There's a lot of no-nonsense analysis of
the cost of interactivity, and interviews with the creators of some
of the web's most successful community sites.
|
 | [Ray & Ray 2003] | Deborah S. Ray and Eric J. Ray:
Unix.
Peachpit Press,
2003,
0321170105.
|
|
A gentle introduction to Unix, with many examples.
|
 | [Rosenberg & Stephens 2005] | Doug Rosenberg and Matt Stephens:
Use Case Driven Object Modeling with UML.
Addison-Wesley,
2005,
0321278275.
|
|
An update of Rosenberg's 1999 book of the same name; in just eight
chapters, the authors present a slimmed-down core of UML organized
around a four-stage design process. Each stage has clearly defined
steps, and concrete milestones which specify what ought to be
produced (i.e., how to tell when you're finished).
|
 | [Scanlan 1989] | David A. Scanlan:
"Structured Flowcharts Outperform Pseudocode: An Experimental Comparison",
IEEE Software,
vol. 6,
no. 5,
pp. 28-36,
1989.
|
|
Describes an experimental comparison of pseudocode with
equivalently-structured flowcharts, in which flowcharts did
much better than in earlier studies.
|
 | [Schneier 2003] | Bruce Schneier:
Beyond Fear.
Springer,
2003,
0387026207.
|
|
A thought-provoking look at how we are encouraged to think about
security, and how much security is actually desirable. For example,
he explains why security systems must not just work well, but
fail well, and why secrecy often undermines security instead
of enhancing it.
|
 | [Schneier 2005] | Bruce Schneier:
Secrets and Lies.
Wiley,
2005,
0471453803.
|
|
Having written the standard book on cryptography, Schneier now
argues that technology alone can't solve most real security
problems. The book covers systems and threats, the technologies used
to protect and intercept data, and strategies for proper
implementation of security systems. Rather than blind faith in
prevention, Schneier advocates swift detection and response to an
attack, while maintaining firewalls and other gateways to keep out
the amateurs.
|
 | [Sedgewick 2001] | Robert Sedgewick:
Algorithms in C, Parts 1-5.
Addison-Wesley Professional,
2001,
0201756080.
|
|
Far too many programmers still think and code as if resizeable
vectors and string-to-pointer hash tables were the only data
structures ever invented. These books are a guide to all the other
conceptual tools that working programmers ought to have at their
fingertips, from sorting and searching algorithms to different kinds
of trees and graphs. The analysis isn't as deep as that in Knuth's
monumental The Art of Programming, but
that makes the book far more accessible. And while the author's use
of C may seem old-fashioned in an age of Java and C#, it does ensure
that nothing magical is hidden inside an overloaded operator or
virtual method call.
|
 | [Skoudis 2004] | Ed Skoudis:
Malware.
Prentice-Hall,
2004,
0131014056.
|
|
This 647-page tome is a survey of harmful software, from viruses
and worms through Trojan horses, root kits, and even malicious
microcode. Each threat is described and analyzed in detail, and
the author gives plenty of examples to show exactly how the attack
works, and how to block (or at least detect) it. The writing is
straightforward, and the case studies in Chapter 10 are funny
without being too cute.
|
 | [Spinellis 2003] | Diomidis Spinellis:
Code Reading.
Addison-Wesley,
2003,
0201799405.
|
|
The book's preface says it best: “The reading of code is
likely to be one of the most common activities of a computing
professional, yet it is seldom taught as a subject or formally used
as a method for learning how to design and program.” Spinellis
isn't the first person to make this point, but he is the first
person I know of to do something about it. In this book, he walks
through hundreds of examples of C, C++, Java, and Perl, drawn from
dozens of Open Source projects such as Apache, NetBSD, and Cocoon.
Each example illustrates a point about how programs are actually
built. How do people represent multi-dimensional tables in C? How
do people avoid nonreentrant code in signal handlers? How do they
create packages in Java? How can you recognize that a data
structure is a graph? A hashtable? That it might contain a race
condition? And on, and on, real-world issue after real-world issue,
each one analyzed and cross-referenced. There's also a section on
additional documentation sources, and a chapter on tools that can
help you make sense of whatever you've just inherited.
|
 | [Steele 1999] | Guy L. Steele Jr.:
"Growing a Language",
Journal of Higher-Order and Symbolic Computation,
vol. 12,
no. 3,
pp. 221-236,
1999.
|
|
The best (and wittiest) discussion ever published of how programming
languages ought to evolve.
|
 | [Spolsky 2004] | Joel Spolsky:
Joel on Software.
APress,
2004,
1590593898.
|
|
Joel on Software collects some of
the witty, insightful articles Spolsky has blogged over the past few
years. His observations on hiring programmers, measuring how well a
development team is doing its job, the API wars, and other topics are always
entertaining and informative. Over the course of forty-five short
chapters, he ranges from the specific to the general and back again,
tossing out pithy observations on the commoditization of the
operating system, why you need to hire more testers, and why NIH
(the not-invented-here syndrome) isn't necessarily a bad thing.
|
 | [Thompson & Chase 2005] | Herbert H. Thompson and Scott G. Chase:
The Software Vulnerability Guide.
Charles River Media,
2005,
1584503580.
|
|
My current favorite guide to computer security for programmers,
this books walks through each major family of security holes in
turn: faulty permission models, bad passwords, macros, dynamic
linking and loading, buffer overflow, format strings and various
injection attacks, temporary files, spoofing, and more.
|
 | [Ullman & Liyanage 2004] | Larry Ullman and Marc Liyanage:
C Programming.
Peachpit Press,
2004,
0321287630.
|
|
A gentle introduction to C, with many examples.
|
 | [Williams & Kessler 2003] | Laurie Williams and Rober Kessler:
Pair Programming Illuminated.
Addison-Wesley,
2003,
0201745763.
|
|
A combination of an instruction manual, a summary of the authors'
empirical studies of pair programming's effectiveness, and
advocacy, this book is the reference guide for anyone who wants
to introduce pair programming into their development team.
|
 | [Wilson 2005] | Greg Wilson:
Data Crunching.
Pragmatic Bookshelf,
2005,
0974514071.
|
|
Every day, all around the world, programmers have to recycle legacy
data, translate from one vendor's proprietary format into another's,
check that configuration files are internally consistent, and search
through web logs to see how many people have downloaded the latest
release of their product. It may not be glamorous, but knowing how
to do it efficiently is essential to being a good programmer. This
book describes the most useful data crunching techniques, explains
when you should use them, and shows how they will make your life
easier.
|
Glossary
A
- absolute path:
-
Yuan Gan
- abstract data types:
- access control:
- access control lists:
- acquire a lock:
- action:
- actors:
- aggregate:
- alias:
- anchor (in regular expression):
- annotated syntax tree:
- assertion:
- asymmetric cipher:
- atomic:
-
Abhishek Ranjan
- attribute (in XML):
- authentication:
- authorization:
- automatic variables (in Make):
B
- basic authentication:
- binary data:
- binary milestone:
- binary mode:
- blacklist:
-
Jeremy Hussell
- boilerplate:
- boundary object:
- branch:
- breakpoint:
-
Abhishek Ranjan
- breakpoint, conditional:
- bug tracker:
- Bugzilla:
C
- call stack:
- callback:
- camel case:
- chain:
- child class:
- chunk:
- cipher:
- ciphertext:
- class:
- class diagram:
-
Simona Mindy
- client:
- code browser:
- code inspection:
- code review:
- cognitive dissonance:
- collision:
- comment spam:
-
Alireza Moayerzadeh
- commit:
- Common Gateway Interface:
- component object model:
- component testing:
- concurrency:
- connection:
- constructor:
- control object:
- cookie:
- Coordinated Universal Time:
- core dump:
-
Nilesh Bansal
- cross product:
-
Amit Chandel
- cross-site scripting:
- CSV:
- cursor:
- CVS:
D
- dashboard:
- data member:
- data modeling:
-
Lei Jiang
- database column:
- database key:
-
Amit Chandel
- database row:
- database table:
-
Amit Chandel
- daylight savings time:
- dead code:
- deadlock:
-
Yuan Gan
- debuggee:
- decryption:
- default target:
- defense in depth:
- defensive programming:
- derive:
- design pattern:
-
Lei Jiang
- development process:
-
Yuan Gan
- dictionary:
- dictionary key:
- directed graph:
- directory tree:
- docstrings:
- document:
- Document Object Model:
- domain model:
-
Lei Jiang
- domain model diagram:
- drive:
- driver:
E
- element (in XML):
- embed:
- encryption:
- entity object:
- environment variable:
-
Nilton Bila
- epoch:
-
Nilton Bila
- error:
- escape sequence:
- event-driven programming:
- exception:
- executable documentation:
- exponent:
- extend:
- Extreme Programming:
-
Tingting Zou
F
- fail:
- filename extension:
- filter:
- finite state machine:
- fixture:
- flag:
- foreign key:
- form:
G
- garbage collect:
-
Nilesh Bansal
- getter:
- group:
- greedy matching:
H
- heap:
-
Abhishek Ranjan
- heisenbug:
- hexadecimal:
- hijack:
- host address:
- HTTP header:
-
Nilesh Bansal
I
- ICONIX process:
- idiom:
- immutable:
- in-place operator:
- inheritance:
- instance:
- instruction pointer:
- instrument:
- Integrated Development Environment:
- integration testing:
- integrity constraint:
- internet protocol:
-
Nilesh Bansal
- invert:
- issue tracker:
J
- join:
-
Amit Chandel
K
L
- leap second:
- leap year:
- Liskov Substitution Principle:
- little-endian:
-
Alireza Moayerzadeh
- local time:
- lock:
- logging:
M
- macro:
- mailing list:
- Make:
- mantissa:
- match object:
- memory model:
- merge:
- method:
-
Nilton Bila
- milestone:
- module:
- multi-valued assignment:
- multiplicity:
- Multipurpose Internet Mail Extensions:
- mutable:
N
- nested query:
O
- object:
-
Lei Jiang
- object-oriented analysis and design (OOAD):
-
Tingting Zou
- operating system:
-
Nikos Sarkas
- operator overloading:
-
Nilton Bila
- optimistic concurrency:
-
Tingting Zou
- outcome, actual:
- outcome, expected:
- override:
P
- pack:
- pair programming:
- parent class:
- parent directory:
- pass:
- path:
- pattern rule:
- pessimistic concurrency:
-
Tingting Zou
- phony target:
- pipe:
-
Abhishek Ranjan
- plaintext:
- polymorphism:
- port:
- post mortem:
- post mortem debugging:
- post-condition:
- pre-condition:
- prerequisite:
- pretty-printer:
- private key:
- process:
- profiling:
- program slice:
- project velocity:
- protocol:
- public key:
- public key cryptography:
- publish-subscribe:
Q
R
- race condition:
- raise exception:
- record:
-
Amit Chandel
- refactor:
- reference count:
- reflection:
- register:
- regression:
- regression testing:
- regular expression:
-
Abhishek Ranjan
- relational database:
-
Yuan Gan
- relative path:
-
Yuan Gan
- release a lock:
- release log:
-
A file (often a spreadsheet) that records exactly what was
shipped to whom, when.
- reluctant matching:
- remote procedure call:
- replay attack:
- repository:
- repository browser:
- risk assessment:
- roadmap:
- robustness diagram:
- role:
- root directory:
-
Nilton Bila
- RSS:
-
Alireza Moayerzadeh
- rule:
S
- sample:
- scaffolding:
- screen scraping:
- search path:
- seek:
- sequence:
- sequence diagram:
- server:
-
Nilesh Bansal
- setter:
- shared library:
- shell:
- short-circuit evaluation:
- Simple API for XML:
-
Alireza Moayerzadeh
- single-step:
- slice:
- social engineering:
- socket:
- SourceForge:
-
Nikos Sarkas
- sparse:
- spawn:
- special method:
- spiral model:
- SQL:
- SQL injection:
- stack frame:
- stack pointer:
- standard error:
- standard input:
- standard output:
- state assertion:
- static space:
- status code:
- step into:
- step over:
- submodule:
- Subversion:
- suspended process:
- SWIG:
- symbolic debugger:
- symmetric cipher:
- system testing:
T
- tag (in XML):
- target:
- target program:
- template:
- test:
- test result:
- test suite:
- test-driven development (TDD):
-
Alireza Moayerzadeh
- testing, bottom-up:
- text:
- three-tier architecture:
- ticket:
- ticket, closed:
- ticket, open:
- time structure:
- time zone:
- timeline:
- traceability:
- transaction:
-
Nikos Sarkas
- Transmission Control Protocol (TCP):
-
Nikos Sarkas
- triage:
- tuple:
- two's complement:
U
- Unicode:
-
Nikos Sarkas (see also ).
- Unified Modeling Language (UML):
-
Lei Jiang
- unit testing:
- unpack:
- URL encode:
- use case:
-
Tingting Zou
- use case diagram:
-
Simona Mindy
- user stories:
-
Simona Mindy
V
- validate:
- verifiable deliverable:
- version control system:
- version number:
-
Large projects typically use multi-part version numbers to
identify software releases. In extreme cases,
“6.2.3.1407” means major version 6, minor version 2,
patch 3, build 1407. The major version number is only incremented
when significant changes are made, where
“significant” means “changes that make this version's
data/configuration/whatever impossible for older versions to
read”. Minor version numbers are what most people think of
as releases. If you've added a few new features, changed part of the
GUI, etc., you increment the minor version number and throw it to
customers.
Patches are things that don't have their own installers. If,
for example, you need to change one HTML form, or one
DLL, you will often just mail that out to customers, along with
instructions about where to put it, rather than creating a new
installer. You should still give it a number, though, and make an
entry in your release log. Finally, the build number is
incremented every time you create a new version of the product for QA
to test. Build numbers are never reset, i.e. you don't go from
5.2.2.1001 to 6.0.0.0, but from 5.2.2.1001 to 6.0.0.1002, and so
on. Build numbers are what developers care about: they're often only
matched up with version numbers after the fact (i.e. you create build
#1017, QA says, “Yeah, it looks good,” so you say,
“All right, this'll be 6.1.0,” and voila, you have
6.1.0.1017.)
Finally, groups will sometimes identify pre-releases as
“beta 1”, “beta 2”, and so on, as in "6.2
beta 2". Again, this label is usually attached to a particular
build after the fact—you wait until QA (or whoever) says
that build #1017 is good enough to send out to customers, then tag
it in version control.
- virtual machine:
- visitor:
W
- watchpoint:
- waterfall model:
-
Simona Mindy
- web services:
- web spider:
-
Simona Mindy
- weblog:
-
Jeremy Hussell
- whitelist:
-
Jeremy Hussell
- wiki:
-
Jeremy Hussell
- wildcard:
-
Jeremy Hussell
- working copy:
X
Y
Z
Online Resources
- Andrew Kuchling's Python Regular Expression HOWTO:
-
A short tutorial on the basics of regular expressions aimed at
Python programmers.
(Viewed 2005-09-01.)
- Apache Web Server:
-
The main site for the most widely-used web server in the world.
(Viewed 2005-07-26.)
- ArgoUML:
-
The most advanced open source UML tool available.
(Viewed 2005-08-08.)
- Bitten:
-
A tool for running builds continuously in the background across multiple machines.
(Viewed 2005-09-01.)
- CruiseControl:
-
A framework for managing continuous builds. Each time you update something in your version control repository, CruiseControl recompiles your code, re-runs your tests, and lets you (and your teammates) know if you've broken anything.
(Viewed 2005-07-26.)
- Cygwin:
-
A Linux-like environment for Windows, which brings with it a lot of other tools (like SSH and GNU Make).
(Viewed 2005-07-26.)
- Daily Python URL:
-
Daily (more or less) news about
Python modules, new releases of the interpreter, meetings, etc.
(Viewed 2005-07-26.)
- Dive Into Python:
-
The complete text of
[Pilgrim 2004]
available on-line.
(Viewed 2005-09-06.)
- Doctor Dobb's Journal Python-URL:
-
Bi-weekly news from the world of
Python.
(Viewed 2005-09-01.)
- Dept. of Computer Science, University of Toronto:
-
The principal author's home department, which helped make this course happen.
(Viewed 2005-07-26.)
- Eclipse:
-
Originally developed by IBM for Java development, Eclipse is the biggest open source
development environment around these days. There are literally hundreds of plugins
for it, and hundreds of thousands of users. It's not for the faint of heart (and it
definitely won't be happy on a four-year-old hand-me-down machine), but it's one of
the real power tools of modern programming.
(Viewed 2005-08-10.)
- Emacs:
-
It doesn't actually stand for “eighteen megabytes and constantly swapping”, and it isn't
actually a Lisp-based operating system, but there's a grain of truth in each claim. Many programmers
swear by it; others massage their aching wrists and swear at it instead, but there's no denying
that it's a very powerful tool.
(Viewed 2005-08-10.)
- Firefox:
-
The best web browser around, where “best” means “nicest interface”, “most
extensible”, and “least insecure”.
(Viewed 2005-07-26.)
- GIMP:
-
The GNU Image Manipulation Program is a powerful tool for creating and manipulating images in a wide variety of formats.
(Viewed 2005-07-26.)
- Internet Groupware for Scientific Collaboration:
-
While it is now several years old, Udell's examination of what the web could be, and how it could help scientists collaborate more effectively, is still as thought-provoking as it was when it first appeared.
(Viewed 2005-07-26.)
- Komodo:
-
A commercial IDE that supports Perl, PHP, Tcl, and XSLT as well as
Python.
(Viewed 2005-07-29.)
- MySQL:
-
The most popular open source database around (though many discerning users prefer
PostgreSQL).
(Viewed 2005-07-29.)
- Numeric:
-
The standard numerical programming module for
Python, with good support for multi-dimensional arrays.
(Viewed 2005-08-16.)
- PEP-008: Python Style Guide:
-
A semi-official guide to
Python coding conventions.
(Viewed 2005-07-26.)
- PostgreSQL:
-
The main site for an advanced open source relational database. It may not be as popular as its main competitor, MySQL, but most people who have used both have found PostgreSQL easier to work with.
(Viewed 2005-07-26.)
- PyChecker:
-
A code checking tool for
Python that complements PyLint.
(Viewed 2005-09-06.)
- PyDev:
-
PyDev is a plugin for
Eclipse, an open source IDE backed by IBM that
was originally targeted at Java, but which now has plugins to support several other
languages. Like Eclipse, PyDev needs a lot of memory to run well.
(Viewed 2005-08-10.)
- PyLint:
-
A code checking tool for
Python that complements PyChecker.
(Viewed 2005-09-06.)
- Python:
-
The main site for all things Python.
(Viewed 2005-07-26.)
- Python Imaging Library:
-
A widely-used image processing library for
Python, capable of working with every popular graphics format around.
(Viewed 2005-07-28.)
- Python Cookbook:
-
An ever-growing collection of
Python tips and tricks.
(Viewed 2005-07-29.)
- Python Software Foundation:
-
A non-profit organization devoted to advancing open source technology related to
Python, and the main financial sponsor of this course.
(Viewed 2005-07-26.)
- RapidSVN:
-
A cross-platform GUI for
Subversion.
(Viewed 2005-09-13.)
- SciPy:
-
A set of extensions to the Numeric module for scientific programming.
(Viewed 2005-08-16.)
- SCons:
-
A powerful
Python-based build management tool.
(Viewed 2005-07-28.)
- SmartSVN:
-
A cross-platform GUI for
Subversion.
(Viewed 2005-07-14.)
- SPE:
-
“Stani's Python Editor” has grown to be more than just
an editor: it now supports several cool tools, including a regular
expression debugger.
(Viewed 2005-08-10.)
- SQLite:
-
A small, simple, and very fast relational database that can be run on its own, or integrated into other applications.
(Viewed 2005-07-26.)
- Subversion:
-
The main site for Subversion is aimed more at Subversion's developers than at its users; if you're looking for a how-to,
[Mason 2005]
is a good place to start.
(Viewed 2005-07-26.)
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!):
-
Joel Spolsky's 15-minute guide to character set encodings, and what you have to do to deal with the fact that most of the world doesn't use the standard American alphabet. This article is reprinted in
[Spolsky 2004]
.
(Viewed 2005-07-26.)
- Top 10 Accessible Web Authoring Practices:
-
Ten simple, easy-to-follow rules that will make your web site more accessible to people with various disabilities.
(Viewed 2005-07-26.)
- TortoiseSVN:
-
A cross-platform GUI for
Subversion.
(Viewed 2005-09-13.)
- University of Toronto:
-
Canada's biggest university, and the host institution for much of this work.
(Viewed 2005-07-26.)
- Vim:
-
“Vi Improved” is a modern version of the classic Unix text editor,
and is supported by an active development community.
(Viewed 2005-08-10.)
- WingIDE:
-
A commercial IDE targeted solely at
Python developers.
(Viewed 2005-07-29.)
- YesLogic:
-
Makers of Prince, the document formatter and generator used to produce the
PDF version of these notes.
(Viewed 2005-10-05.)
List of Figures
List of Tables
Syllabus
Software Carpentry
- course outline
- acknowledgments
Introduction
- motivation for course
- need to improve quality as well as efficiency
- key idea
- target audience
- self-test
- course philosophy
- topics
- what you will need
- recommended reading
- typographic conventions
Version Control
- motivation for this lecture
- version control systems
- CVS and Subversion
- basic operations
- command line and GUI clients
- merging conflicts
- versioning
- binary files
- rollback
- getting started
- Subversion command reference
- reading Subversion output
- branching and merging
Shell Basics
- motivation for this lecture
- difference between shell and operating system
- the file system
- basic commands
- creating files and directories
- wildcards
More Shell
- input, output, and redirection
- pipes
- environment variables
- the PATH variable
- basic tools
- Unix ownership and permission
- Windows ownership and permission
- some more advanced tools
Basic Scripting
- introducing Python
- interactive use
- batch use
- variables
- printing and quoting
- numbers
- Booleans
- comparisons
- conditionals
- while, break, and continue
Strings, Lists, and Files
- strings
- indexing
- string methods
- lists
- list methods
- for and range
- testing for membership
- aliasing
- tuples
- file I/O
Functions, Libraries, and the File System
- defining functions
- variable scope
- aliasing
- default parameter values
- variable-length argument lists
- functions as objects
- creating modules
- math library
- system library
- representing and manipulating times
- file system programming
- The os.path Module
- The __main__ Keyword
- Python sources
Testing Basics
- motivation
- terminology
- rules for tests
- testing frameworks
- choosing tests
Dictionaries and Error Handling
- motivation
- string formatting
- dictionaries
- working with dictionaries
- dictionary methods
- counting and inverting
- dictionary string formatting
- try, except, and else
- exceptions
- exception handler stack
- raising exceptions
- assertions and defensive programming
- executing subprocesses
Debugging
- what's wrong with print statements
- symbolic debuggers
- running in a debugger
- breakpoints, inspection, and single-stepping
- how debuggers work
- advanced operations
- rule 0
Object-Oriented Programming
- motivation
- defining classes
- defining methods
- defining classes
- special methods
- inheritance
- polymorphism
- Liskov substitution principle
- class members
- operator overloading
Structured Unit Testing
- mechanics
- example
- setup and teardown
- testing assertions
- testing I/O
- testing classes
- test-driven development
Automated Builds
- motivation for this lecture
- overview of Make
- basic features
- handling multiple targets
- defining phony targets
- automatic variables
- pattern rules
- specifying dependencies
- defining variables
Coding Style and Reading Code
- why read code
- cognition
- determining functionality
- naming
- idioms
- style-checking tools
- role of documentation
- traceability
- defensive programming and other pseudo-documentation
- active reading
Watching Programs Run
- Python's object model
- dynamic programming
- coverage
- profiling
Regular Expressions
- basic patterns
- anchors
- escape sequences
- extracting matches
- compiled REs
- other languages
- other patterns
Basic XML and XHTML
- formatting rules
- XHTML
- attributes
- images and links
- accessibility
- DOM
- basic features of DOM
- creating a DOM tree
- walking a DOM tree
- modifying a DOM tree
A Mini-Project
- checking for tabs
- automation
- checking for printable characters
- checking glossary entries
- checking cross-references
Binary Data
- motivation
- representing numbers
- bitwise operators
- shifting
- floating point numbers
- binary I/O
- packing data structures
- variable-length data
- metadata
Relational Databases
- create, insert, and drop
- simple queries
- joins
- negation and nested queries
- aggregation
- application programming
- handling NULL
- concurrency
Client-Side Web Programming
- underlying protocols
- HTTP
- HTTP request
- HTTP response
- urllib
- parameterizing requests
- encoding and decoding
- web services
CGI
- overview
- MIME types
- example
- creating forms
- handling form data
- server-side state
- concurrency
- cookies
Security
- goals
- limitations of technical solution
- terminology
- input is untrustworthy
- the evils of exec
- basic cryptography
- basic cryptography
- HTTPS
- ssh
- guidelines
Teamware
- overview of Trac
- repository browser and timeline
- issue tracker
- writing useful tickets
- mailing lists
- wiki
- roadmap and milestones
- dashboard
- weblogs
- administration
Extreme Programming
- Extreme Programming
- XP core practices
- planning game
- pair programming
- project velocity
- counter-arguments
The ICONIX Process
- use cases
- domain model
- robustness diagrams
- class diagrams
- sequence diagrams
- order of operations
- UML tools
The Nevex Process
- vision statement
- analysis & estimation
- A&E format
- design, implementation, and maintenance
- QA and documentation
- post mortem
Backward, Forward, and Sideways
- intro
- the Numeric module
- Python Imaging Library
- SCons
- integrating C into Python
- SWIG, Fortran, and embedding Python in C
- design patterns
- refactoring
- non-technical reading
- rules of programming
|
Copyright © 2005, Python Software Foundation. See License for details.
| |