Software Carpentry

Welcome

Course Outline

Acknowledgments

Introduction

Motivation

Meeting Standards

The Most Important Idea in This Course

Who You Are

A Quick Self-Test

Learn by Building

Topics

Setting Up

Recommended Reading

Typographic Conventions

Version Control

Problem #1: Synchronizing Files

Problem #2: Undoing Changes

Solution: Version Control

CVS and Subversion

Basic Use

How To Do It

Working Together

What Versions Actually Mean

Warning: Binary Files

Rolling Back Changes

And Finally, Getting Started

Subversion Command Reference

How to Read Subversion Output

Branching and Merging

Exercises

Exercise 3.1:

Follow the instructions given to you by your instructor to check out a copy of the Subversion repository you'll be using in this course. Unless otherwise noted, the exercises below assume that you have done this, and that your working copy is in a directory called course. You will submit all of your exercises in this course by checking files into your repository.

Exercise 3.2:

Create a file course/ex01/bio.txt (where course is the root of your working copy of your Subversion repository), and write a short biography of yourself (100 words or so) of the kind used in academic journals, conference proceedings, etc. Commit this file to your repository. Remember to provide a meaningful comment when committing the file!

Exercise 3.3:

What's the difference between mv and svn mv? Put the answer in a file called course/ex01/mv.txt and commit your changes.

Once you have committed your changes, type svn log in your course directory. If you didn't know what you'd just done, would you be able to figure it out from the log messages? If not, why not?

Exercise 3.4:

In this exercise, you'll simulate the actions of two people editing a single file. To do that, you'll need to check out a second copy of your repository. One way to do this is to use a separate computer (e.g., your laptop, your home computer, or a machine in the lab). Another is to make a temporary directory, and check out a second copy of your repository there. Please make sure that the second copy isn't inside the first, or vice versa—Subversion will become very confused.

Let's call the two working copies Blue and Green. Do the following:

a) Create Blue/ex01/planets.txt, and add the following lines:

Mercury
Venus
Earth
Mars
Jupiter
Saturn

Commit the file.

b) Update the Green repository. (You should get a copy of planets.txt.)

c) Change Blue/ex01/planets.txt so that it reads:

1. Mercury
2. Venus
3. Earth
4. Mars
5. Jupiter
6. Saturn

Commit the changes.

d) Edit Green/ex01/planets.txt so that its contents are as shown below. Do not do svn update before editing this file, as that will spoil the exercise.

Mercury 0
Venus 0
Earth 1
Mars 2
Jupiter 16 (and counting)
Saturn 14 (and counting)

e) Now, in Green, do svn update. Subversion should tell you that there are conflicts in planets.txt. Resolve the conflicts so that the file contains:

1. Mercury 0
2. Venus 0
3. Earth 1
4. Mars 2
5. Jupiter 16
6. Saturn 14

Commit the changes.

f) Update the Blue repository, and check that planets.txt now has the same content as it has in the Green repository.

Exercise 3.5:

Add another line or two to course/ex01/bio.txt and commit those changes. Then, use svn merge to restore the original contents of your biography (course/ex01/bio.txt), and commit the result. When you are done, bio.txt should look the way it did at the end of the first part of the previous exercise.) Note: the purpose of this exercise is to teach you how to go back in time to get old versions of files—while it would be simpler in this case just to edit bio.txt, you can't (reliably) do that when you've made larger changes, to multiple files, over a longer period of time.

Shell Basics

Introduction

The Shell vs. the Operating System

The File System

A Few Simple Commands

Creating Files and Directories

Wildcards

Exercises

Exercise 4.1:

Suppose you are in your home directory, and ls shows you this:

Makefile        biography.txt   data
enrolment.txt   programs        thesis

What argument(s) do you have to give to ls to get it to put a trailing slash after the names of subdirectories, like this:

Makefile        biography.txt   data/
enrolment.txt   programs/       thesis/

If you run ls data, it shows:

earth.txt       jupiter.txt     mars.txt
mercury.txt     saturn.txt      venus.txt

What command should you run to get the following output:

data/earth.txt          data/jupiter.txt        data/mars.txt
data/mercury.txt        data/saturn.txt         data/venus.txt

What if you want this (note that an extra entry is being displayed):

total 7
drwxr-xr-x    7 someone        0 May  6 08:27 .svn
-rw-r--r--    1 someone     2396 May  6 08:38 earth.txt
-rw-r--r--    1 someone     1263 May  6 08:38 jupiter.txt
-rw-r--r--    1 someone     1015 May  6 08:43 mars.txt
-rw-r--r--    1 someone      946 May  6 08:41 mercury.txt
-rw-r--r--    1 someone     1714 May  6 08:40 saturn.txt
-rw-r--r--    1 someone      881 May  6 08:40 venus.txt

Note: the command will display your user ID, rather than someone. On some machines, the command will also display a group ID. Ignore these differences for the purpose of this question.

Exercise 4.2:

According to the listing of the data directory above, who can read the file mercury.txt? Who can write it (i.e., change its contents or delete it)? When was mercury.txt last changed? What command would you run to allow everyone to edit or delete the file?

Exercise 4.3:

Suppose you want to remove all files whose names (not including their extensions) are of length 3, start with the letter a, and have .txt as extension. What command would you use? For example, if the directory contains three files a.txt, abc.txt, and abcd.txt, the command should remove abc.txt , but not the other two files.

Exercise 4.4:

What does the command cd ~ do? What about cd ~gvwilson?

Exercise 4.5:

What's the difference between the commands cd HOME and cd $HOME?

Exercise 4.6:

Suppose you want to list the names of all the text files in the data directory that contain the word "carpentry". What command or commands could you use?

Exercise 4.7:

Suppose you have written a program called analyze. What command or commands could you use to display the first ten lines of its output? What would you use to display lines 50-100? To send lines 50-100 to a file called tmp.txt?

Exercise 4.8:

The command ls data > tmp.txt writes a listing of the data directory's contents into tmp.txt. Anything that was in the file before the command was run is overwritten. What command could you use to append the listing to tmp.txt instead?

Exercise 4.9:

What command(s) would you use to find out how many subdirectories there are in the lectures directory?

Exercise 4.10:

What does rm *.ch? What about rm *.[ch]?

Exercise 4.11:

What command(s) could you use to find out how many instances of a program are running on your computer at once? For example, if you are on Windows, what would you do to find out how many instances of svchost.exe are running? On Unix, what would you do to find out how many instances of bash are running?

Exercise 4.12:

What do the commands pushd, popd, and dirs do? Where do their names come from?

Exercise 4.13:

How would you send the file earth.txt to the default printer? How would you check it made it (other than wandering over to the printer and standing there)?

Exercise 4.14:

A colleague asks for your data files. How would you archive them to send as one file? How could you compress them?

Exercise 4.15:

The instructor wants you to use a hitherto unknown command for manipulating files. How would you get help on this command?

Exercise 4.16:

You have changed a text file on your home PC, and mailed it to the university terminal. What steps can you take to see what changes you may have made, compared with a master copy in your home directory?

Exercise 4.17:

How would you change your password?

Exercise 4.18:

grep is one of the more useful tools in the toolbox. It finds lines in files that match a pattern and prints them out. For example, assume I have files earth.txt and venus.txt containing lines like this:

Name: Earth
Period: 365.26 days
Inclination: 0.00
Eccentricity: 0.02

If I type grep Period *.txt in that directory, I get:

earth.txt:Period: 365.26 days
venus.txt:Period: 224.70 days

Search strings can use regular expressions, which will be discussed in a later lecture. grep takes many options as well; for example, grep -c /bin/bash /etc/passwd reports how many lines in /etc/passwd (the Unix password file) that contain the string /bin/bash, which in turn tells me how many users are using bash as their shell.

Suppose all you wanted was a list of the files that contained lines matching a pattern, rather than the matches themselves—what flag or flags would you give to grep? What if you wanted the line numbers of matching lines?

Exercise 4.19:

diff finds and displays the differences between two files. It works best if both files are plain text (i.e., not images or Excel spreadsheets). By default, it shows the differences in groups, like this:

3c3,4
< Inclination: 0.00
---
> Inclination: 0.00 degrees
> Satellites: 1

(The rather cryptic header "3c3,4" means that line 3 of the first file must be changed to get lines 3-4 of the second.)

What flag(s) should you give diff to tell it to ignore changes that just insert or delete blank lines? What if you want to ignore changes in case (i.e., treat lowercase and uppercase letters as the same)?

Exercise 4.20:

Suppose you wanted ls to sort its output by filename extension, i.e., to list all .cmd files before all .exe files, and all .exe's before all .txt files. What command or commands would you use?

More Shell

Redirecting Input and Output

Pipes

Environment Variables

How the Shell Finds Programs

Basic Tools

Ownership and Permission: Unix

Ownership and Permission: Windows

More Advanced Tools

Exercises

Exercise 5.1:

You're worried your data files can be read by your nemesis, Dr. Evil. How would you check whether or not he can, and if necessary change permissions so only you can read or write the files?

Basic Scripting

Why Python?

Running Python Interactively

Running Saved Programs

Variables

Printing and Quoting

Numbers and Arithmetic

Booleans

Comparisons

Conditionals

While Loops, Break, and Continue

Strings, Lists, and Files

Where We Just Were

But First, Strings

Slicing, Bounds, and Negative Indices

String Methods

Lists

List Methods

For Loops and Ranges

Membership

Nesting Lists

Tuples

Files

Other Ways to Do It

Exercises

Exercise 7.1:

What does "aaaaa".count("aaa") return? Why?

Exercise 7.2:

What does the built-in function enumerate do? Use it to write a function called findOver that takes a list of numbers called values, and a number called threshold, as arguments, and returns a list of the locations where items in values are greater than threshold. For example, findOver([1.1, 3.8, -1.6, 7.4], 2.0) should return [1, 3], since the values in the input list at locations 1 and 3 are greater than the threshold 2.0.

Exercise 7.3:

What do each of the following five code fragments do? Why?

x = ['a', 'b', 'c', 'd']
x[0:2] = []
x = ['a', 'b', 'c', 'd']
x[0:2] = ['q']
x = ['a', 'b', 'c', 'd']
x[0:2] = 'q'
x = ['a', 'b', 'c', 'd']
x[0:2] = 99
x = ['a', 'b', 'c', 'd']
x[0:2] = [99]

Exercise 7.4:

What does 'a'.join(['b', 'c', 'd']) return? If you have a list of strings, how can you concatenate them in a single statement? Why do you think join is written this way, rather than as ['b', 'c', 'd'].join('a')?

Functions, Libraries, and the File System

Where We Just Were

Defining Functions

Scope

Parameter Passing Rules

Default Parameter Values

Extra Arguments

Functions Are Objects

Creating Modules

The Math Library

The System Library

Times

Working with the File System

Manipulating Pathnames

Knowing Where You Are

Where to Learn More

Exercises

Exercise 8.1:

Write a function that takes two strings called text and fragment as arguments, and returns the number of times fragment appears in the second half of text. Your function must not create a copy of the second half of text. (Hint: read the documentation for string.count.)

Exercise 8.2:

What does the Python keyword global do? What are some reasons not to write code that uses it?

Exercise 8.3:

Consider the following sample of code and its output:

def settings(first, **rest):
    print 'first is', first
    print 'rest is'
    for (name, value) in rest.items():
        print '...', name, value
    print

settings(1)
settings(1, two=2, three="THREE")
first is 1
rest is

first is 1
rest is
... two 2
... three THREE

What does the variable rest do? What does the double asterisk ** in front of its name mean? How does it compare to the example with *extra (with a single asterisk) in the lecture?

Exercise 8.4:

Python allows you to import all the functions and variables in a module at once, making them local name. For example, if the module is called values, and contains a variable called Threshold and a function called limit, then after the statement from values import *, you can then refer directly to Threshold and limit, rather than having to use values.Threshold or values.limit. Explain why this is generally considered a bad thing to do, even though it reduces the amount programmers have to type.

Exercise 8.5:

sys.stdin, sys.stdout, and sys.stderr are variables, which means that you can assign to them. For example, if you want to change where print sends its output, you can do this:

import sys

print 'this goes to stdout'
temp = sys.stdout
sys.stdout = open('temporary.txt', 'w')
print 'this goes to temporary.txt'
sys.stdout = temp

Do you think this is a good programming practice? When and why do you think its use might be justified?

Exercise 8.6:

os.stat(path) returns an object whose members describe various properties of the file or directory identified by path. Using this, write a function that will determine whether or not a file is more than one year old.

Exercise 8.7:

Write a Python program that takes as its arguments two years (such as 1997 and 2007), prints out the number of days between the 15th of each month from January of the first year until December of the last year.

Exercise 8.8:

Write a simple version of which in Python. Your program should check each directory on the caller's path (in order) to find an executable program that has the name given to it on the command line.

Testing Basics

Motivation

Terminology

Example: Rectangle Overlap

General Rules for Unit Tests

A Simple Testing Framework

Choosing Test Cases

Dictionaries and Error Handling

Motivation

String Formatting

Dictionaries

The Mechanics

Dictionary Methods

Counting Frequency

Formatting Strings with Dictionaries

Catching Errors

Exception Objects

Functions and Exceptions

Raising Exceptions

Assertions

Running Other Programs

Exercises

Exercise 10.1:

Suppose you wanted to sort entries with the same frequency alphabetically. What changes would you have to make to compareByFrequency?

Debugging

What It Is

What's Wrong with Print Statements

Symbolic Debuggers

Running in a Debugger

Basic Operations

How Debuggers Work

Advanced Operations

Rule 0: Get It Right the First Time

Rule 1: What Is It Supposed to Do?

Rule 2: Is It Plugged In?

Rule 3: Make It Fail

Rule 4: Divide and Conquer

Rule 5: Change One Thing at a Time, For a Reason

Rule 6: Write It Down

Rule 7: Be Humble

Summary

Object-Oriented Programming

Motivation

A Naked Class

Methods

Defining a Queue

Special Methods

Inheritance

Polymorphism

The Substitution Principle

Class Members

Overloading Operators

Structured Unit Testing

A Unit Testing Framework

Mechanics

Testing a Function

Eliminating Redundancy

Testing for Failure

Testing I/O

Testing With Classes

Test-Driven Development

Exercises

Exercise 13.1:

Python has another unit testing module called doctest. It searches files for sections of text that look like interactive Python sessions, then re-executes those sections and checks the results. A typical use is shown below.

def ave(values):
    '''Calculate an average value, or 0.0 if 'values' is empty.
    >>> ave([])
    0.0
    >>> ave([3])
    3.0
    >>> ave([15, -1.0])
    7.0
    '''

    sum = 0.0
    for v in values:
        sum += v
    return sum / float(max(1, len(values)))

if __name__ == '__main__':
    import doctest
    doctest.testmod()

Convert a handful of the tests you have written for other questions in this lecture to use doctest. Do you prefer it to unittest? Why or why not? Do you think doctest makes it easier to test small problems? Large ones? Would it be possible to write something similar for C, Java, Fortran, or Mathematica?

Automated Builds

How Do You Rebuild A Program?

Automate, Automate, Automate

Our Example

Hello, Make

Multiple Targets

Phony Targets

Automatic Variables

Pattern Rules

Dependencies

Defining Macros

Analysis

Exercises

Exercise 14.1:

How can you stop Make from removing intermediate files automatically when it finishes processing?

Exercise 14.2:

Make gets definitions from environment variables, command-line parameters, and explicit definitions in Makefiles. What order does it check these in?

Coding Style and Reading Code

Introduction

Why Read Code?

Seven Plus or Minus

What Does This Function Do?

Naming

Idioms

Style Tools

What About Documentation?

Traceability

Executable Documentation

Active Reading

Summary

Watching Programs Run

Turing's Great Insight

Faking Objects

How Other Languages Do It

Runtime Tricks

Coverage

Profiling

Summary

Exercises

Exercise 16.1:

What percentage of your code is tests? Is tested?

Exercise 16.2:

Can you honestly say that you write tests before code? Find out how many tests currently pass or fail with a single command? Identify the tests associated with a bug? Tell if your code meets the team's standards?

Exercise 16.3:

Can you find out which functions use the most CPU time? How long threads spend blocked on I/O? Who allocates memory, where, for what? How accurate these numbers are? How today's profile differs from last month's? How the profile differs across machines?

Regular Expressions

Introduction

A Simple Example

Anchoring

Escape Sequences

Extracting Matches

Compiling

Using REs in Other Languages

But Wait, There's More

Exercises

Exercise 17.1:

By default, regular expression matches are greedy: the first term in the RE matches as much as it can, then the second part, and so on. As a result, if you apply the RE ⌈X(.*)X(.*)⌋ to the string "XaX and XbX", the first group will contain "aX and Xb", and the second group will be empty.

It's also possible to make REs match reluctantly, i.e., to have the parts match as little as possible, rather than as much. Find out how to do this, and then modify the RE in the previous paragraph so that the first group winds up containing "a", and the second group " and XbX".

Basic XML and XHTML

Overview

History

Formatting Rules

XHTML

Attributes

More XHTML Tags

Connecting to Other Data

Accessibility

The Document Object Model

The Basics

Creating a Tree

Walking a Tree

Modifying the Tree

Summary

A Mini-Project

Eating Your Own Cooking

Checking for Tabs

Running Tools

Checking for Printable Characters

Checking Glossary Entries

Checking Cross-References

Summary

Exercises

Exercise 19.1:

What does getopt do when it encounters an argument it doesn't recognize? Write a short program that demonstrates this behavior, that can be run on its own without the user passing in any command-line arguments.

Binary Data

Isn't It All 'Binary'?

How Numbers Are Stored

Bitwise Operators

Shifting

Floating Point

Binary I/O

Packing and Unpacking

The struct Module

Packing Variable-Length Data

Metadata

Relational Databases

It's All Tables

Getting Started

Basic SQL

Simple Queries

Joins

Negation and Nested Queries

Aggregation

Using Other Languages

Handling Null Values

Concurrency

Client-Side Web Programming

Small Pieces, Loosely Joined

Distributed Is Different

Under the Hood

The Hypertext Transfer Protocol

HTTP Request

HTTP Response

Example

Fetching Pages

Passing Parameters

Special Characters

Web Services

Summary

CGI

The Active Web

The Common Gateway Interface

MIME Types

Hello, CGI

Creating Forms

Handling Forms

Development Tips

Maintaining State

What About Concurrency?

Who Are You Again?

Summmary

Security

Evil Exists

What Are We Trying to Do?

Technology Alone Cannot Solve the Problem!

How to Think About Security

Rule 1: Don't Trust Your Input

Paths and I/O

Rule 2: Never Run User Commands

Cryptography 101

How Asymmetric Cryptography Works?

Securing HTTP

How do you login remotely?

Do and Don't

Summary

Teamware

Introduction

Overview

Repository Browser and Timeline

Issue Tracker

How to Write Tickets

Mailing Lists

Wiki

Roadmap and Milestones

Dashboard

Blogging

Administration

Summary

Exercises

Exercise 25.1:

Can you find out what bugs are currently being worked on? What feature requests have been deferred? Which files were changed to fix a problem? What fixes are currently being tested? How long it took to fix/implement something?

Exercise 25.2:

What is the status of the overnight build? The overnight regression tests? The issue database? The team's discussions?

Extreme Programming

Code Early, Code Often

Waterfalls and Spirals

Enter the Extremists

Core Practices

The Planning Game

Pair Programming

Project Velocity

The Down Side

Summary

The ICONIX Process

Do It Once, Do It Right

The Unified Modeling Language

From Here to There

Use Cases

Domain Model

Robustness Diagrams

Class Diagrams

Sequence Diagrams

Actual Order

A Note on Tools

Summary

The Nevex Process

A Happy Medium

All the Vision You'll Ever Need

Analysis & Estimation

What Goes Into an A&E

Getting It Built

Other Activities

After the Party's Over

Exercises

Exercise 28.1:

Does your manager know when you expect to complete your current task? How inaccurate the schedule currently is?

Exercise 28.2:

Can you find out when your manager expects you to complete your current task (without asking her directly)? When team members expect to complete their current tasks (without asking them directly)? Who would be affected if you slipped a week?

Backward, Forward, and Sideways

And Now We May Begin

Numerical Programming

Imaging

A Better Way to Build

Integrating with C

Integrating the Other Way

Design Patterns

Refactoring

And a Little Light Reading

The Rules

Conclusion

Bibliography

[Agans 2002]David J. Agans: Debugging. American Management Association, 2002, 0814471684.
Its first sentence says, “This book tells you how to find out what's wrong with stuff, quick,” and that's exactly what it does. In fifteen (very) short chapters, the author presents nine simple rules to help you track down and fix problems in software, hardware, or anything else. His war stories are entertaining (although I think one or two are urban myths), and his advice is eminently practical.
[Brand 1995]Stewart Brand: How Buildings Learn. Penguin USA, 1995, 0140139966.
This beautiful, thought-provoking book starts with the observation that most architects spend their time re-working or extending existing buildings, rather than creating new ones from scratch. Of course, if Brand had written “program” instead of “building”, and “programmer” where he'd written “architect”, everything he said would have been true of computing as well. A lot of software engineering books try to convey the same message about allowing for change, but few do it so successfully. By presenting examples ranging from the MIT Media Lab to a one-room extension to a house, Brand encourages us to see patterns in the way buildings change (or, to adopt Brand's metaphor, the way buildings learn from their environment and from use). Concurrently, he uses those insights to argue that since buildings are always going to be modified, they should be designed to accommodate unanticipated change.
[Castro 2002]Elizabeth Castro: HTML for the World Wide Web. Peachpit Press, 2000, 0321130073.
A clean, clear, comprehensive guide to creating HTML for the web, with good coverage of Cascading Style Sheets (CSS).
[Castro 2000]Elizabeth Castro: XML for the World Wide Web. Peachpit Press, 2000, 0201710986.
Like other books in Peachpit's Visual Quickstart series, this one is beautifully designed, and easy to read without ever being condescending. Its 16 chapters and 4 appendices are organized into 1- and 2-page explanations of particular topics, from writing non-empty elements to namespaces, schemas, and XML transformation. Throughout, Castro strikes a perfect balance between “what”, “why”, and “how”, and provides a surprising amount of detail without ever overwhelming the reader.
[Chase & Simon 1973]W.G. Chase and H.A. Simon: "Perception in chess", Cognitive Psychology, vol. 4, no. , pp. 55-81, 1973.
The original paper comparing the performance of novice and master chess players when confronted with actual and random positions.
[Clark 2004]Mike Clark: Pragmatic Project Automation. Pragmatic Bookshelf, 2004, 0974514039.
This entry in the Pragmatic Bookshelf series focuses on getting your project to build itself, and (more importantly) tell you how the build went, automatically. Clark doesn't confine himself to running Make at 3:00 a.m.; he also covers ways of automatically re-running tests, building and testing installers, monitoring applications, and more.
[DeMarco & Lister 1999]Tom DeMarco and Timothy Lister: Peopleware. Dorset House, 1999, 0932633439.
This was the first book I ever read that said that the leading cause of software project failure was people, rather than technology. Using anecdotes, humor, and common sense, DeMarco and Lister explain how important good physical space, aligning authority with responsibility, and clear direction are.
[Fehily 2003]Chris Fehily: SQL. Peachpit Press, 2003, 0321118030.
This very readable book describes the 5% of SQL that covers 95% of real-world needs. While the book moves a little slowly in some places, the examples are exceptionally clear.
[Feldman 1979]Stuart I. Feldman: "Make---A Program for Maintaining Computer Programs", Software: Practice and Experience, vol. 9, no. 4, pp. 255-265, 1979.
The original description of Make. Last time I checked, Stu Feldman was a vice president at IBM, which shows you just how far a good tool can take you…
[Feathers 2005]Michael C. Feathers: Working Effectively with Legacy Code. Prentice-Hall PTR, 2005, 0131177052.
Most programmers spend most of their time fixing bugs, porting to new platforms, adding new features---in short, changing existing code. If that code is exercised by unit tests, then changes can be made quickly and safely; if it isn't, they can't, so your first job when you inherit legacy code should be to write some. That's where this book comes in. What to know three different ways to inject a test into a C++ class without changing the code? They're here. Want to know which classes or methods to focus testing on? Read his discussion of pinch points. Need to break inter-class dependencies in Java so that you can test one module without having to configure the entire application? That's in here too, along with dozens of other useful bits of information. Everything is illustrated with small examples, all of them clearly explained and to the point. There are lots of simple diagrams, and a short glossary; all that's missing is hype.
[Fowler 1999]Martin Fowler: Refactoring. Addison-Wesley Professional, 1999, 0201485672.
Like architects, most programmers spend most of their time renovating, rather than creating something completely new on a blank sheet of paper. This book presents and analyzes patterns that come up again and again when programs are being reorganized. Some of these are well-known, such as placing common code in a utility method. Others, such as replacing temporary objects with queries, or replacing constructors with factory methods, are subtler, but no less important. Each entry includes a section on motivation, the mechanics of actually carrying out the transformation, and an example in Java.
[Friedl 2002]Jeffrey E. F. Friedl: Mastering Regular Expressions. O'Reilly, 2002, 0596002890.
The definitive programmer's guide to regular expressions.
[Gamma et al 1995]Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides: Design Patterns. Addison-Wesley, 1995, 0201633612.
The book that started the software design patterns movement. Much of the discussion has been superseded by more recent books, and the use of C++ and Smalltalk for examples feels a little dated, but it is still a landmark in programming.
[Glass 2002]Robert L. Glass: Facts and Fallacies of Software Engineering. Addison-Wesley Professional, 2002, 0321117425.
I really wish someone had given me something like this book when I took my first programming job. If nothing else, it would have been a better way to start thinking about the profession I had stumbled into than the “everybody knows” factoids that I soaked up at coffee time. Some of what he says is well-known: good programmers are up to N times better than bad ones (his value for N is 28), reusable components are three times harder to build than non-reusable ones, and so on. Other facts aren't part of the zeitgeist, though they should be. For example, most of us know that maintenance consumes 40-80% of software costs, but did you know that roughly 60% of that is enhancements, rather than bug fixes? Or that if more than 20-25% of a component has to be modified, it is more efficient to re-write it from scratch? Best of all, Glass backs up every statement he makes with copious references to the primary literature; if you still disagree with him, you'd better be sure you have as much evidence for your point of view as he has for his.
[Goerzen 2004]John Goerzen: Foundations of Python Network Programming. APress, 2004, 1590593715.
This book looks at how to handle several common protocols, including HTTP, SMTP, and FTP. Goerzen also doesn't delve as deeply into their internals, but instead on how to build clients that use them. His approach is to build solutions to complex problems one step at a time, explaining each addition or modification along the way. He occasionally assumes more background knowledge than most readers of this book are likely to have, but only occasionally, and makes up for it by providing both clear code, and clear explanations of why this particular function has to do things in a particular order, or why that one really ought to be multithreaded.
[Good 2005]Nathan A. Good: Regular Expression Recipes. APress, 2005, 159059441X.
A great how-to for regular expressions, with examples in many different languages.
[Gunderloy 2004]Mike Gunderloy: Coder to Developer. Sybex, 2004, 078214327X.
This practical, readable book is subtitled “Tools and Strategies for Delivering Your Software”, and that's exactly what it's about. Project planning, source code control, unit testing, logging, and build management are all there. Importantly, so are newer topics, like building plugins for your IDE, code generation, and things you can do to protect your intellectual property. Everything is clearly explained, and illustrated with well-chosen examples. While the focus is definitely on .NET, Gunderloy covers a wide range of other technologies, both proprietary and open source. I'm already using two new tools based on references from this book, and plan to make the chapter on “Working with Small Teams” required reading for my students.
[Harold 2004]Elliotte Rusty Harold: Effective XML. Addison-Wesley, 2004, 0321150406.
This book explains which of XML's many features should be used when: Item 12 tells you to store metadata in attributes, and then spends six pages explaining why, while Item 24 analyzes the strengths and weaknesses of various schema languages, and Item 38 covers character set encodings. It's more than most developers will ever want to know, but when you need it, you really need it.
[Hock 2004]Roger R. Hock: Forty Studies that Changed Psychology. Prentice Hall, 2004, 0131147293.
In forty short chapters, Hock describes the turning points in our understanding of how our minds work. The book isn't just about psychology; you'll also learn a lot about how science gets done, and about the scientists who do it.
[Hunt & Thomas 1999]Andrew Hunt and David Thomas: The Pragmatic Programmer. Addison-Wesley, 1999, 020161622X.
This book is about those things that make up the difference between typing in code that compiles, and writing software that reliably does what it's supposed to. Topics range from gathering requirements through design, to the mechanics of coding, testing, and delivering a finished product. The second section, for example, covers “The Evils of Duplication”, “Orthogonality”, “Reversibility”, “Tracer Bullets”, “Prototypes and Post-It Notes”, and “Domain Languages”, and illuminates each with plenty of examples and short exercises.
[Johnson 2000]Jeff Johnson: GUI Bloopers. Morgan Kaufmann, 2000, 1558605827.
Most books on GUI design are long on well-meaning aesthetic principles, but short on examples of what it means to put those principles into practice. In contrast, GUI Bloopers presents case study after case study: what's wrong with this dialog? What should its creators have done instead. And, most importantly, why? The net effect is to teach all of the same principles that other books try to, but in a grounded, understandable way.
[Kernighan & Pike 1984]Brian W. Kernighan and Rob Pike: The Unix Programming Environment. Prentice Hall, 1984, 013937681X.
I have long believed that this book is the real secret to Unix's success. It doesn't just show readers how to use Unix---it explains why the operating system is built that way, and how its "lots of little tools" philosophy keeps simple tasks simple, while making hard ones doable.
[Kernighan & Ritchie 1998]Brian W. Kernighan and Dennis Ritchie: The C Programming Language. Prentice Hall PTR, 1998, 0131103628.
The classic description of the one programming language every serious programmer absolutely, positively has to learn.
[Knuth 1998]Donald E. Knuth: The Art of Programming. Addison-Wesley, 1998, 0201485419.
The lifework of the man who invented many of the basic concepts of algorithm analysis, these massive tomes are like Everest: awe-inspiring, but not for the weak of heart. Most readers will find [Sedgewick 2001] much more approachable.
[Langtangen 2004]Hans P. Langtangen: Python Scripting for Computational Science. Springer-Verlag, 2004, 3540435085.
The book's aim is to show scientists and engineers with little formal training in programming how Python can make their lives better. Regular expressions, numerical arrays, persistence, the basics of GUI and web programming, interfacing to C, C++, and Fortran: it's all here, along with hundreds of short example programs. Some readers may be intimidated by the book's weight, and the dense page layout, but what really made me blink was that I didn't find a single typo or error. It's a great achievement, and a great resource for anyone doing scientific programming.
[Lutz & Ascher 2003]Mark Lutz and David Ascher: Learning Python. O'Reilly, 2003, 0596002815.
This is not only the best introduction to Python on the market, it is one of the best introductions to any programming language that I have ever read. Lutz and Ascher cover the entire core of the language, and enough of its advanced features and libraries to give readers a feeling for just how powerful Python is. In keeping with the spirit of the language itself, their writing is clear, their explanations lucid, and their examples well chosen.
[Margolis & Fisher 2002]Jane Margolis and Allan Fisher: Unlocking the Clubhouse. MIT Press, 2002, 0262133989.
This book describes a project at Carnegie-Mellon University that tried to figure out why so few women become programmers, and what can be done to correct the imbalance. Its first six chapters describe the many small ways in which we are all, male and female, are conditioned to believe that computers are "boy's things". Sometimes it's as simple as putting the computer in the boy's room, because "he's the one who uses it most". Later on, the "who needs a social life?" atmosphere of undergraduate computer labs drives many women away (and many men, too). The last two chapters describe what the authors have done to remedy the situation at high schools and university. This work proves that by being conscious of the many things that turn women off computing, and by viewing computer science from different angles, we can attract a broader cross-section of society, which can only make our discipline a better place to be. The results are impressive: female undergraduate enrolment at CMU rose by more than a factor of four during their work, while the proportion of women dropping out decreased significantly.
[Martelli 2005]Alex Martelli, Anna Ravenscroft, and David Ascher: Python Cookbook. O'Reilly, 2005, 0596007973.
A useful reference for every serious Python programmer, this book is a collection of tips and tricks, some very simple, others so complex that they require careful line-by-line reading. The book's companion web site is updated regularly.
[Mason 2005]Mike Mason: Pragmatic Version Control Using Subversion. Pragmatic Bookshelf, 2005, 0974514063.
Yet another book from the folks at Pragmatic, this one is everything you'll ever need to know about Subversion, which is on its way to becoming the version control system of choice for open source development.
[McConnell 2004]Steve McConnell: Code Complete. Microsoft Press, 2004, 0735619670.
This classic is a handbook of do's and don'ts for working programmers. It covers everything from how to avoid common mistakes in C to how to set up a testing framework, how to organize multi-platform builds, and how to coordinate the members of a team. In short, it is everything I wished someone had told me before I started my first full-time programming job.
[McConnell 1996]Steve McConnell: Rapid Development. Microsoft Press, 1996, 1556159005.
This book describes what it takes to develop robust code quickly, what mistakes are often made in the name of rapid development, and how to identify and analyze potential risks. It includes a list of 25 best practices, and discusses things that most other books leave out (like recovering from disasters and dealing with impossible demands). Unlike most “how to do it better” books, it isn't try to sell any particular practice or style, which adds even more weight to McConnell's carefully balanced opinions.
[McConnell 1997]Steve McConnell: Software Project Survival Guide. Microsoft Press, 1997, 1572316217.
A condensed manager-level version of the same author's Rapid Development.
[Pilgrim 2004]Mark Pilgrim: Dive Into Python. APress, 2004, 1590593561.
A good introduction to Python, which is also available on-line at Dive Into Python.
[Powazek 2001]Derek M. Powazek: Design for Community. New Riders, 2001, 0735710759.
This book isn't about web logging, streaming video, or managing mailing lists. Instead, it's about how to structure web sites so that they will foster on-line communities. The writing is personal without being sappy or overbearing, and the author draws upon a wealth of personal experience to explain why you sometimes don't want to make it easy for people to post comments, or how best to deal with abusive posters. There's a lot of no-nonsense analysis of the cost of interactivity, and interviews with the creators of some of the web's most successful community sites.
[Ray & Ray 2003]Deborah S. Ray and Eric J. Ray: Unix. Peachpit Press, 2003, 0321170105.
A gentle introduction to Unix, with many examples.
[Rosenberg & Stephens 2005]Doug Rosenberg and Matt Stephens: Use Case Driven Object Modeling with UML. Addison-Wesley, 2005, 0321278275.
An update of Rosenberg's 1999 book of the same name; in just eight chapters, the authors present a slimmed-down core of UML organized around a four-stage design process. Each stage has clearly defined steps, and concrete milestones which specify what ought to be produced (i.e., how to tell when you're finished).
[Scanlan 1989]David A. Scanlan: "Structured Flowcharts Outperform Pseudocode: An Experimental Comparison", IEEE Software, vol. 6, no. 5, pp. 28-36, 1989.
Describes an experimental comparison of pseudocode with equivalently-structured flowcharts, in which flowcharts did much better than in earlier studies.
[Schneier 2003]Bruce Schneier: Beyond Fear. Springer, 2003, 0387026207.
A thought-provoking look at how we are encouraged to think about security, and how much security is actually desirable. For example, he explains why security systems must not just work well, but fail well, and why secrecy often undermines security instead of enhancing it.
[Schneier 2005]Bruce Schneier: Secrets and Lies. Wiley, 2005, 0471453803.
Having written the standard book on cryptography, Schneier now argues that technology alone can't solve most real security problems. The book covers systems and threats, the technologies used to protect and intercept data, and strategies for proper implementation of security systems. Rather than blind faith in prevention, Schneier advocates swift detection and response to an attack, while maintaining firewalls and other gateways to keep out the amateurs.
[Sedgewick 2001]Robert Sedgewick: Algorithms in C, Parts 1-5. Addison-Wesley Professional, 2001, 0201756080.
Far too many programmers still think and code as if resizeable vectors and string-to-pointer hash tables were the only data structures ever invented. These books are a guide to all the other conceptual tools that working programmers ought to have at their fingertips, from sorting and searching algorithms to different kinds of trees and graphs. The analysis isn't as deep as that in Knuth's monumental The Art of Programming, but that makes the book far more accessible. And while the author's use of C may seem old-fashioned in an age of Java and C#, it does ensure that nothing magical is hidden inside an overloaded operator or virtual method call.
[Skoudis 2004]Ed Skoudis: Malware. Prentice-Hall, 2004, 0131014056.
This 647-page tome is a survey of harmful software, from viruses and worms through Trojan horses, root kits, and even malicious microcode. Each threat is described and analyzed in detail, and the author gives plenty of examples to show exactly how the attack works, and how to block (or at least detect) it. The writing is straightforward, and the case studies in Chapter 10 are funny without being too cute.
[Spinellis 2003]Diomidis Spinellis: Code Reading. Addison-Wesley, 2003, 0201799405.
The book's preface says it best: “The reading of code is likely to be one of the most common activities of a computing professional, yet it is seldom taught as a subject or formally used as a method for learning how to design and program.” Spinellis isn't the first person to make this point, but he is the first person I know of to do something about it. In this book, he walks through hundreds of examples of C, C++, Java, and Perl, drawn from dozens of Open Source projects such as Apache, NetBSD, and Cocoon. Each example illustrates a point about how programs are actually built. How do people represent multi-dimensional tables in C? How do people avoid nonreentrant code in signal handlers? How do they create packages in Java? How can you recognize that a data structure is a graph? A hashtable? That it might contain a race condition? And on, and on, real-world issue after real-world issue, each one analyzed and cross-referenced. There's also a section on additional documentation sources, and a chapter on tools that can help you make sense of whatever you've just inherited.
[Steele 1999]Guy L. Steele Jr.: "Growing a Language", Journal of Higher-Order and Symbolic Computation, vol. 12, no. 3, pp. 221-236, 1999.
The best (and wittiest) discussion ever published of how programming languages ought to evolve.
[Spolsky 2004]Joel Spolsky: Joel on Software. APress, 2004, 1590593898.
Joel on Software collects some of the witty, insightful articles Spolsky has blogged over the past few years. His observations on hiring programmers, measuring how well a development team is doing its job, the API wars, and other topics are always entertaining and informative. Over the course of forty-five short chapters, he ranges from the specific to the general and back again, tossing out pithy observations on the commoditization of the operating system, why you need to hire more testers, and why NIH (the not-invented-here syndrome) isn't necessarily a bad thing.
[Thompson & Chase 2005]Herbert H. Thompson and Scott G. Chase: The Software Vulnerability Guide. Charles River Media, 2005, 1584503580.
My current favorite guide to computer security for programmers, this books walks through each major family of security holes in turn: faulty permission models, bad passwords, macros, dynamic linking and loading, buffer overflow, format strings and various injection attacks, temporary files, spoofing, and more.
[Ullman & Liyanage 2004]Larry Ullman and Marc Liyanage: C Programming. Peachpit Press, 2004, 0321287630.
A gentle introduction to C, with many examples.
[Williams & Kessler 2003]Laurie Williams and Rober Kessler: Pair Programming Illuminated. Addison-Wesley, 2003, 0201745763.
A combination of an instruction manual, a summary of the authors' empirical studies of pair programming's effectiveness, and advocacy, this book is the reference guide for anyone who wants to introduce pair programming into their development team.
[Wilson 2005]Greg Wilson: Data Crunching. Pragmatic Bookshelf, 2005, 0974514071.
Every day, all around the world, programmers have to recycle legacy data, translate from one vendor's proprietary format into another's, check that configuration files are internally consistent, and search through web logs to see how many people have downloaded the latest release of their product. It may not be glamorous, but knowing how to do it efficiently is essential to being a good programmer. This book describes the most useful data crunching techniques, explains when you should use them, and shows how they will make your life easier.

Glossary

A

absolute path:
Yuan Gan
abstract data types:
access control:
access control lists:
acquire a lock:
action:
actors:
aggregate:
alias:
anchor (in regular expression):
annotated syntax tree:
assertion:
asymmetric cipher:
atomic:
Abhishek Ranjan
attribute (in XML):
authentication:
authorization:
automatic variables (in Make):

B

basic authentication:
binary data:
binary milestone:
binary mode:
blacklist:
Jeremy Hussell
boilerplate:
boundary object:
branch:
breakpoint:
Abhishek Ranjan
breakpoint, conditional:
bug tracker:
Bugzilla:

C

call stack:
callback:
camel case:
chain:
child class:
chunk:
cipher:
ciphertext:
class:
class diagram:
Simona Mindy
client:
code browser:
code inspection:
code review:
cognitive dissonance:
collision:
comment spam:
Alireza Moayerzadeh
commit:
Common Gateway Interface:
component object model:
component testing:
concurrency:
connection:
constructor:
control object:
cookie:
Coordinated Universal Time:
core dump:
Nilesh Bansal
cross product:
Amit Chandel
cross-site scripting:
CSV:
cursor:
CVS:

D

dashboard:
data member:
data modeling:
Lei Jiang
database column:
database key:
Amit Chandel
database row:
database table:
Amit Chandel
daylight savings time:
dead code:
deadlock:
Yuan Gan
debuggee:
decryption:
default target:
defense in depth:
defensive programming:
derive:
design pattern:
Lei Jiang
development process:
Yuan Gan
dictionary:
dictionary key:
directed graph:
directory tree:
docstrings:
document:
Document Object Model:
domain model:
Lei Jiang
domain model diagram:
drive:
driver:

E

element (in XML):
embed:
encryption:
entity object:
environment variable:
Nilton Bila
epoch:
Nilton Bila
error:
escape sequence:
event-driven programming:
exception:
executable documentation:
exponent:
extend:
Extreme Programming:
Tingting Zou

F

fail:
filename extension:
filter:
finite state machine:
fixture:
flag:
foreign key:
form:

G

garbage collect:
Nilesh Bansal
getter:
group:
greedy matching:

H

heap:
Abhishek Ranjan
heisenbug:
hexadecimal:
hijack:
host address:
HTTP header:
Nilesh Bansal

I

ICONIX process:
idiom:
immutable:
in-place operator:
inheritance:
instance:
instruction pointer:
instrument:
Integrated Development Environment:
integration testing:
integrity constraint:
internet protocol:
Nilesh Bansal
invert:
issue tracker:

J

join:
Amit Chandel

K

L

leap second:
leap year:
Liskov Substitution Principle:
little-endian:
Alireza Moayerzadeh
local time:
lock:
logging:

M

macro:
mailing list:
Make:
mantissa:
match object:
memory model:
merge:
method:
Nilton Bila
milestone:
module:
multi-valued assignment:
multiplicity:
Multipurpose Internet Mail Extensions:
mutable:

N

nested query:

O

object:
Lei Jiang
object-oriented analysis and design (OOAD):
Tingting Zou
operating system:
Nikos Sarkas
operator overloading:
Nilton Bila
optimistic concurrency:
Tingting Zou
outcome, actual:
outcome, expected:
override:

P

pack:
pair programming:
parent class:
parent directory:
pass:
path:
pattern rule:
pessimistic concurrency:
Tingting Zou
phony target:
pipe:
Abhishek Ranjan
plaintext:
polymorphism:
port:
post mortem:
post mortem debugging:
post-condition:
pre-condition:
prerequisite:
pretty-printer:
private key:
process:
profiling:
program slice:
project velocity:
protocol:
public key:
public key cryptography:
publish-subscribe:

Q

R

race condition:
raise exception:
record:
Amit Chandel
refactor:
reference count:
reflection:
register:
regression:
regression testing:
regular expression:
Abhishek Ranjan
relational database:
Yuan Gan
relative path:
Yuan Gan
release a lock:
release log:
A file (often a spreadsheet) that records exactly what was shipped to whom, when.
reluctant matching:
remote procedure call:
replay attack:
repository:
repository browser:
risk assessment:
roadmap:
robustness diagram:
role:
root directory:
Nilton Bila
RSS:
Alireza Moayerzadeh
rule:

S

sample:
scaffolding:
screen scraping:
search path:
seek:
sequence:
sequence diagram:
server:
Nilesh Bansal
setter:
shared library:
shell:
short-circuit evaluation:
Simple API for XML:
Alireza Moayerzadeh
single-step:
slice:
social engineering:
socket:
SourceForge:
Nikos Sarkas
sparse:
spawn:
special method:
spiral model:
SQL:
SQL injection:
stack frame:
stack pointer:
standard error:
standard input:
standard output:
state assertion:
static space:
status code:
step into:
step over:
submodule:
Subversion:
suspended process:
SWIG:
symbolic debugger:
symmetric cipher:
system testing:

T

tag (in XML):
target:
target program:
template:
test:
test result:
test suite:
test-driven development (TDD):
Alireza Moayerzadeh
testing, bottom-up:
text:
three-tier architecture:
ticket:
ticket, closed:
ticket, open:
time structure:
time zone:
timeline:
traceability:
transaction:
Nikos Sarkas
Transmission Control Protocol (TCP):
Nikos Sarkas
triage:
tuple:
two's complement:

U

Unicode:
Nikos Sarkas (see also ).
Unified Modeling Language (UML):
Lei Jiang
unit testing:
unpack:
URL encode:
use case:
Tingting Zou
use case diagram:
Simona Mindy
user stories:
Simona Mindy

V

validate:
verifiable deliverable:
version control system:
version number:
Large projects typically use multi-part version numbers to identify software releases. In extreme cases, “6.2.3.1407” means major version 6, minor version 2, patch 3, build 1407. The major version number is only incremented when significant changes are made, where “significant” means “changes that make this version's data/configuration/whatever impossible for older versions to read”. Minor version numbers are what most people think of as releases. If you've added a few new features, changed part of the GUI, etc., you increment the minor version number and throw it to customers. Patches are things that don't have their own installers. If, for example, you need to change one HTML form, or one DLL, you will often just mail that out to customers, along with instructions about where to put it, rather than creating a new installer. You should still give it a number, though, and make an entry in your release log. Finally, the build number is incremented every time you create a new version of the product for QA to test. Build numbers are never reset, i.e. you don't go from 5.2.2.1001 to 6.0.0.0, but from 5.2.2.1001 to 6.0.0.1002, and so on. Build numbers are what developers care about: they're often only matched up with version numbers after the fact (i.e. you create build #1017, QA says, “Yeah, it looks good,” so you say, “All right, this'll be 6.1.0,” and voila, you have 6.1.0.1017.) Finally, groups will sometimes identify pre-releases as “beta 1”, “beta 2”, and so on, as in "6.2 beta 2". Again, this label is usually attached to a particular build after the fact—you wait until QA (or whoever) says that build #1017 is good enough to send out to customers, then tag it in version control.
virtual machine:
visitor:

W

watchpoint:
waterfall model:
Simona Mindy
web services:
web spider:
Simona Mindy
weblog:
Jeremy Hussell
whitelist:
Jeremy Hussell
wiki:
Jeremy Hussell
wildcard:
Jeremy Hussell
working copy:

X

Y

Z

Online Resources

List of Figures

List of Tables

Syllabus

Software Carpentry

Introduction

Version Control

Shell Basics

More Shell

Basic Scripting

Strings, Lists, and Files

Functions, Libraries, and the File System

Testing Basics

Dictionaries and Error Handling

Debugging

Object-Oriented Programming

Structured Unit Testing

Automated Builds

Coding Style and Reading Code

Watching Programs Run

Regular Expressions

Basic XML and XHTML

A Mini-Project

Binary Data

Relational Databases

Client-Side Web Programming

CGI

Security

Teamware

Extreme Programming

The ICONIX Process

The Nevex Process

Backward, Forward, and Sideways