Empirical analysis algorithm example

Empirical analysis. If the running time of our program (approximately) obeys a power law T(n) ~ an b , we can use a doubling hypothesis to estimate the coefficients a and b. Tilde notation. We say that f(n) ~ g(n) if f(n)/g(n) converges to 1 as n gets large. This is a general concept about mathematical functions and is not restricted to running time, memory, or any other specific domain. Cost model. For theoretical analyses of running time in COS 226, we will assume a cost model, namely that some particular operation (or operations) dominates the running time of a program. Then, we express the running time in terms of the total number of that operation as a function of the input size. To simplify things, we usually give this frequency count in tilde notation. Order of growth. If we have two functions f(n) and g(n), and f(n) ~ c g(n) for some constant c > 0, we say the order of growth of f(n) is g(n). Typically g(n) is one of the following functions: 1, log n, n, n log n, n 2 , n 3 , or 2 n . Worst-case order of growth isn't everything. Just because one algorithm has a better order of growth than other does not mean that it is faster in practice. We will encounter some notable counterexamples, including quicksort vs. mergesort. Memory analysis. Know how to calculate the memory utilization of a class with the 64-bit memory model from the textbook. Theoretical and empirical analysis. Hypotheses generated through theoretical analysis (or guesswork like our power law assumption) should be validated with data before being fully trusted.

Recommended Problems

C level

Textbook 1.4.4
Fall 2011 Midterm, #2

B level

Textbook 1.4.5
Spring 2012 Midterm, #1
For each of the functions shown, give the best order of growth of the running time.

f1 is Linear.
f2 is N^3 because each iteration of the inner loop is quadratic in the outer loop variable. The simplest way to do this is to realize it is just the integration of i^2.
f3 is 2^N. Each iteration spawns two iterations. Thus by the time we get to the bottom level(where n=1), we've produced 2! total calls of 3.
f4 is linear. This is similar to the pattern that we saw in Mergesort and Quicksort, except that each recursive call does only a constant amount of work instead of a linear amount. It is the same as the pattern for bottom up heapification. At the top level,we do 1 unit of work; at the 2nd level,we do 2 units of work; at the 3rd level, we do 4 units, etc. The total amount of work is thus given by 1 + 2 + 4 + 8 + ? + ?. This sum is linear in N.
f5 is N log N. This is the exact same pattern as Mergesort and Quicksort. If you want to think of it as a sum, then it's N + N + N + . N, which are log(base 2)N summands.
f6 is log* N. After the first iteration, i = 2. After the second iteration, i = 2^2. After the third iteration, i = 2^2^2, etc. This takes Log*N steps to reach N. If you weren't totally sure, you could have also observed that Log*N was the only answer between constant and LogN. -->

(b) For each recurrence relation, what is the running time for each T(N) (use tilde notation)? Answers

Algorithm 1: c N Algorithm 2: c log N Algorithm 3: c N log N

d. N³ log N

A level

The code below operates on bacterial genomes of approximately 1 megabyte in size.

int N = Integer.parseInt(args[0]); String[] genomes = new String[N]; for (int i = 0; i < N; i++) < In gfile = new In("genomeFile" + i + ".txt"); genomes[i] = gfile.readString(); >for (int i = 1; i < N; i++) < for (int j = i; j >0; j--) < if (genomes[j-1].length() >genomes[j].length()) exch(genomes, j-1, j); else break; > >

What is the theoretical order of growth of the worst case running time as a function of N?
A table of runtimes for the program above is given below. Approximate the empirical run time in tilde notation as a function of N. Do not leave your answer in terms of logarithms.

N Time (s) 1 0.15 2 0.14 4 0.19 8 0.41 16 0.85 32 1.66 64 3.38

N 2 . Reading in the genomes is linear time. The for loops are just insertion sort, which is N 2 in the worst-case (where the break statement never occurs).
0.05N (ok if left in terms of a fraction)
Multiple acceptable answers:
1. The input may not be a worstcase input (e.g. already sorted)
2. The time to read in a 64 megabytes of genomes is much larger than the time to perform 64 2 string length compares and 64 2 swaps of 8 byte references.
3. Partial credit: N is too small / time is too short