List of Tags

- javascript (210)
- c# (130)
- git (119)
- java (110)
- jquery (99)
- python (94)
- c++ (87)
- .net (85)
- android (71)
- html (58)
- c (51)
- iphone (50)
- css (48)
- version-control (44)
- objective-c (41)
- php (41)
- string (41)
- ios (32)
- performance (31)
- language-agnostic (30)
- sql (30)
- algorithm (29)
- database (27)
- vim (25)
- c++-faq (24)
- asp.net (23)
- ruby (23)
- arrays (23)
- eclipse (22)
- mysql (22)
- sql-server (22)
- xcode (20)
- svn (20)
- windows (19)
- json (17)
- ruby-on-rails (17)
- cocoa-touch (17)
- asp.net-mvc (17)
- security (16)
- url (16)
- bash (16)
- ajax (15)
- ide (14)
- operators (14)
- node.js (13)
- coding-style (13)
- unit-testing (13)
- http (13)
- command-line (13)
- enums (12)
- gui (12)
- functional-programming (12)
- regex (12)
- visual-studio (12)
- cocoa (12)
- shell (12)
- list (11)
- branch (11)
- datetime (11)
- hidden-features (11)
- tsql (11)
- unix (11)
- optimization (11)
- linq (10)
- design-patterns (10)
- xml (10)
- random (10)
- scala (10)
- merge (10)
- math (10)
- oop (10)
- editor (10)
- sorting (10)
- dom (9)
- rest (9)
- polls (9)
- sqlite (9)
- browser (9)
- html5 (9)
- function (8)
- dvcs (8)
- lambda (8)
- github (8)
- syntax (8)
- memory-management (8)
- image (8)
- xcode4 (8)
- books (8)
- linux (8)
- wpf (8)
- vi (8)
- multithreading (8)
- osx (8)
- collections (7)
- generics (7)
- constructor (7)
- git-branch (7)
- haskell (7)
- pointers (7)
- open-source (7)
- terminology (7)
- javascript-events (7)
- tips-and-tricks (7)
- cross-browser (6)
- mercurial (6)
- entity-framework (6)
- forms (6)
- database-design (6)
- django (6)
- console (6)
- casting (6)
- user-interface (6)
- ipad (6)
- iframe (6)
- r (6)
- .net-4.0 (6)
- file (6)
- design (6)
- search (6)
- language-features (6)
- date (6)
- vb.net (6)
- dictionary (6)
- email (5)
- git-commit (5)
- gitignore (5)
- layout (5)
- google-chrome (5)
- closures (5)
- private (5)
- reference (5)
- data-structures (5)
- google (5)
- theory (5)
- big-o (5)
- singleton (5)
- exception (5)
- passwords (5)
- interface (5)
- programming-languages (5)
- soap (5)
- serialization (5)
- resources (5)
- initialization (5)
- variables (5)
- asynchronous (5)
- grep (5)
- productivity (5)
- floating-point (5)
- join (5)

What is a plain English explanation of Big O? With as little formal definition as possible and simple mathematics.

Answered By: cletus ( 2272)

Quick note, this is almost certainly confusing Big O notation (which is an upper bound) with Theta notation (which is a two-side bound). In my experience this is actually typical of discussions in non-academic settings. Apologies for any confusion caused.

The simplest definition I can give for Big-O notation is this:

**Big-O notation is a relative representation of the complexity of an algorithm.**

There are some important and deliberately chosen words in that sentence:

**relative:**you can only compare apples to apples. You can't compare an algorithm to do arithmetic multiplication to an algorithm that sorts a list of integers. But two algorithms that do arithmetic operations (one multiplication, one addition) will tell you something meaningful;**representation:**Big-O (in its simplest form) reduces the comparison between algorithms to a single variable. That variable is chosen based on observations or assumptions. For example, sorting algorithms are typically compared based on comparison operations (comparing two nodes to determine their relative ordering). This assumes that comparison is expensive. But what if comparison is cheap but swapping is expensive? It changes the comparison; and**complexity:**if it takes me one second to sort 10,000 elements how long will it take me to sort one million? Complexity in this instance is a relative measure to something else.

Come back and reread the above when you've read the rest.

The best example of Big-O I can think of is doing arithmetic. Take two numbers (123456 and 789012). The basic arithmetic operations we learnt in school were:

- addition;
- subtraction;
- multiplication; and
- division.

Each of these is an operation or a problem. A method of solving these is called an **algorithm**.

Addition is the simplest. You line the numbers up (to the right) and add the digits in a column writing the last number of that addition in the result. The 'tens' part of that number is carried over to the next column.

Let's assume that the addition of these numbers is the most expensive operation in this algorithm. It stands to reason that to add these two numbers together we have to add together 6 digits (and possibly carry a 7th). If we add two 100 digit numbers together we have to do 100 additions. If we add two 10,000 digit numbers we have to do 10,000 additions.

See the pattern? The **complexity** (being the number of operations) is directly proportional to the number of digits *n* in the larger number. We call this **O(n)** or **linear complexity**.

Subtraction is similar (except you may need to borrow instead of carry).

Multiplication is different. You line the numbers up, take the first digit in the bottom number and multiply it in turn against each digit in the top number and so on through each digit. So to multiply our two 6 digit numbers we must do 36 multiplications. We may need to do as many as 10 or 11 column adds to get the end result too.

If we have two 100-digit numbers we need to do 10,000 multiplications and 200 adds. For two one million digit numbers we need to do one trillion (10^{12}) multiplications and two million adds.

As the algorithm scales with n-*squared*, this is **O(n ^{2})** or

**We only care about the most significant portion of complexity.**

The astute may have realized that we could express the number of operations as: n^{2} + 2n. But as you saw from our example with two numbers of a million digits apiece, the second term (2n) becomes insignificant (accounting for 0.0002% of the total operations by that stage).

**The Telephone Book**

The next best example I can think of is the telephone book, normally called the White Pages or similar but it'll vary from country to country. But I'm talking about the one that lists people by surname and then initials or first name, possibly address and then telephone numbers.

Now if you were instructing a computer to look up the phone number for "John Smith" in a telephone book that contains 1,000,000 names, what would you do? Ignoring the fact that you could guess how far in the S's started (let's assume you can't), what would you do?

A typical implementation might be to open up to the middle, take the 500,000^{th} and compare it to "Smith". If it happens to be "Smith, John", we just got real lucky. Far more likely is that "John Smith" will be before or after that name. If it's after we then divide the last half of the phone book in half and repeat. If it's before then we divide the first half of the phone book in half and repeat. And so on.

This is called a **binary search** and is used every day in programming whether you realize it or not.

So if you want to find a name in a phone book of a million names you can actually find any name by doing this at most 20 times. In comparing search algorithms we decide that this comparison is our 'n'.

For a phone book of 3 names it takes 2 comparisons (at most).

For 7 it takes at most 3.

For 15 it takes 4.

...

For 1,000,000 it takes 20.

That is staggeringly good isn't it?

In Big-O terms this is **O(log n)** or **logarithmic complexity**. Now the logarithm in question could be ln (base e), log_{10}, log_{2} or some other base. It doesn't matter it's still O(log n) just like O(2n^{2}) and O(100n^{2}) are still both O(n^{2}).

It's worthwhile at this point to explain that Big O can be used to determine three cases with an algorithm:

**Best Case:**In the telephone book search, the best case is that we find the name in one comparison. This is**O(1)**or**constant complexity**;**Expected Case:**As discussed above this is O(log n); and**Worst Case:**This is also O(log n).

Normally we don't care about the best case. We're interested in the expected and worst case. Sometimes one or the other of these will be more important.

Back to the telephone book.

What if you have a phone number and want to find a name? The police have a reverse phone book but such lookups are denied to the general public. Or are they? Technically you can reverse lookup a number in an ordinary phone book. How?

You start at the first name and compare the number. If it's a match, great, if not, you move on to the next. You have to do it this way because the phone book is **unordered** (by phone number anyway).

So to find a name:

**Best Case:**O(1);**Expected Case:**O(n) (for 500,000); and**Worst Case:**O(n) (for 1,000,000).

**The Travelling Salesman**

This is quite a famous problem in computer science and deserves a mention. In this problem you have N towns. Each of those towns is linked to 1 or more other towns by a road of a certain distance. The Travelling Salesman problem is to find the shortest tour that visits every town.

Sounds simple? Think again.

If you have 3 towns A, B and C with roads between all pairs then you could go:

A -> B -> C

A -> C -> B

B -> C -> A

B -> A -> C

C -> A -> B

C -> B -> A

Well actually there's less than that because some of these are equivalent (A -> B -> C and C -> B -> A are equivalent, for example, because they use the same roads, just in reverse).

In actuality there are 3 possibilities.

Take this to 4 towns and you have (iirc) 12 possibilities. With 5 it's 60. 6 becomes 360.

This is a function of a mathematical operation called a **factorial**. Basically:

```
5! = 5 * 4 * 3 * 2 * 1 = 120
6! = 6 * 5 * 4 * 3 * 2 * 1 = 720
7! = 7 * 6 * 5 * 4 * 3 * 2 * 1 = 5040
...
25! = 25 * 24 * ... * 2 * 1 = 15,511,210,043,330,985,984,000,000
...
50! = 50 * 49 * ... * 2 * 1 = 3.04140932... × 10^64
```

So the Big-O of the Travelling Salesman problem is **O(n!)** or **factorial or combinatorial complexity**.

**By the time you get to 200 towns there isn't enough time left in the universe to solve the problem with traditional computers.**

Something to think about.

**Polynomial Time**

Another point I wanted to make quick mention of is that any algorithm that has a complexity of **O(n ^{a})** is said to have

Traditional computers can solve polynomial-time problems. Certain things are used in the world because of this. Public Key Cryptography is a prime example. It is computationally hard to find two prime factors of a very large number. If it wasn't, we couldn't use the public key systems we use.

Anyway, that's it for my (hopefully plain English) explanation of Big O (revised).

Yesterday I was pairing the socks from the clean laundry, and figured out the way I was doing it is not very efficient. I was doing a naive search — picking one sock and "iterating" the pile in order to find its pair. This requires iterating over n/2 * n/4 = n^{2}/8 socks on average.

As a computer scientist I was thinking what I could do? Sorting (according to size/color/...) of course came into mind to achieve an O(NlogN) solution.

Hashing or other not-in-place solutions are not an option, because I am not able to duplicate my socks (though it could be nice if I could).

**So, the question is basically:**

Given a pile of `n`

pairs of socks, containing `2n`

elements (assume each sock has exactly one matching pair), what is the best way to pair them up efficiently with up to logarithmic extra space? (I believe I can remember that amount of info if needed.)

I will appreciate an answer that addresses the following aspects:

- A general
*theoretical*solution for a huge number of socks. - The actual number of socks is not that large, I don't believe me and my spouse have more than 30 pairs. (And it is fairly easy to distinguish between my socks and hers, can this be utilized as well?)
- Is it equivalent to the element distinctness problem?

Answered By: usr ( 613)

Sorting solutions have been proposed but **sorting is a little too much**: We don't need order, **we just need equality groups**.

So **hashing** would be enough (and faster).

- For each color of socks,
**form a pile**. Iterate over all socks in your input basket**and distribute them onto the color piles**. - Iterate over each pile and
**distribute it by some other metric**(e.g. pattern) into a second set of piles **Recursively apply this scheme**until you have distributed all socks onto**very small piles that you can visually process immediately**

This kind of recursive hash partitioning is actually being done by **SQL Server** when it needs to hash join or hash aggregate over huge data sets. It distributes its build input stream into many partitions which are independent. This scheme scales to arbitrary amounts of data and multiple CPUs linearly.

You don't need recursive partitioning if you can find a distribution key (hash key) that **provides enough buckets** that each bucket is small enough to be processed very quickly. Unfortunately, I don't think socks have such a property.

If each sock had an integer called "PairID" one could easily distribute them into 10 buckets according to `PairID % 10`

(the last digit).

The best real-world partitioning I can think of is creating a **rectangle of piles**: one dimension is color, the other is pattern. Why a rectangle? Because we need O(1) random-access to piles. (A 3D cuboid would also work, but that is not very practical.)

Update:

What about **parallelism**? Can multiple humans match the socks faster?

- The simplest parallization strategy is to have multiple workers take from the input basket and put the socks onto the piles. This only scales up so much - imagine 100 people fighting over 10 piles.
**The synchronization costs**(manifesting themselves as hand-collisions and human communication)**destroy efficiency and speed-up**(see the Universal Scalability Law!). - It scales nearly indefinitely if
**each worker has its own set of piles**. Workers can then take from the input basket big chunks of socks (very little contention as they are doing it rarely) and they do not need to sync when distributing the socks at all (because they have thread-local piles). At the end all workers need to union their pile-sets. I believe that can be done in O(log (worker count * piles per worker)) if the workers form an**aggregation tree**.

What about the element distinctness problem? As the article states, the element distinctness problem can be solved in `O(N)`

. This is the same for the socks problem (also `O(N)`

, if you need only one distribution step (I proposed multiple steps only because humans are bad at calculations - one step is enough if you distribute on `md5(color, length, pattern, ...)`

, i.e. a **perfect hash** of all attributes)).

Clearly, one cannot go faster than `O(N)`

so we have reached the **optimal lower bound**.

Although the outputs are not exactly the same (in one case, just a boolean. In the other case, the pairs of socks), the asymptotic complexities are the same.

I'm trying to create globally-unique identifiers in Javascript. I'm not sure what routines are available on all browsers, how "random" and seeded the built-in random number generator is, etc..

The GUID / UUID should be at least 32 characters and should stay in the ASCII range to avoid trouble when passing them around.

Answered By: broofa ( 650)

For an rfc4122 version 4 compliant solution, this one-liner(ish) solution is the most compact I could come up with.:

```
'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
var r = Math.random()*16|0, v = c == 'x' ? r : (r&0x3|0x8);
return v.toString(16);
});
```

E.g:

```
>>> 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {var r = Math.random()*16|0,v=c=='x'?r:r&0x3|0x8;return v.toString(16);});
"3bce4931-6c75-41ab-afe0-2ec108a30860"
```

One of the most interesting projects I've worked in the past couple years as I was still a student, was a final project about image processing. The goal was to develop a system to be able to recognize Coca-Cola **cans** (note that I'm stressing the word cans, you'll see why in a minute). You can see a sample below, with the can recognized in the green rectangle with scale and rotation.

Some contraints on the project:

- The background could be very noisy.
- The can could have any scale or rotation or even orientation (within reasonable limits)
- The image could have some degree of fuziness (contours could be not really straight)
- There could be Coca-Cola bottles in the image, and the algorithm should only detect the can !
- The brightness of the image could vary a lot (so you can't rely "too much" on color detection.
- The can could be partly hidden on the sides or the middle (and possibly partly hidden behind the bottle !)
- There could be no cans at all in the image, in which case you had to find nothing and write a message saying so.

So you could end up with tricky things like this (which in this case had my algorithm totally fail):

Now I've done this project obviously as it was a while ago, and had a lot of fun doing it, and I had a decent implementation. Here are some details about my implementation:

**Language**: Done in C++ using OpenCV library.

**Pre-processing**: Regarding image pre-processing I mean how to transform it in a more raw form to give to the algorithm. I used 2 methods:

- Changing color domain from RGB to HSV (Hue Saturation Value) and filtering based on "red" hue, saturation above a certain threshold to avoid orange-like colors, and filtering of low value to avoid dark tones. The end result was a binary black and white image, where all white pixels would represent the pixels that match this threshold. Obviously there is still a lot of crap in the image, but this reduces the number of dimensions you have to work with).
- Noise filtering using median filtering (taking the median pixel value of all neighbors and replace the pixel by this value) to reduce noise.
- Using Canny Edge Detection Filter to get the contours of all items after 2 precedent steps.

**Algorithm**: The algorithm itself I chose for this task was taken from this (awesome) book on feature extraction and called Generalized Hough Transform (pretty different from the regular Hough Transform). It basically says a few things:

- You can describe an object in space without knowing its analytical equation (which is the case here).
- It is resistent to image deformations such as scaling and rotation, as it will basically test your image for every combination of scale factor and rotation factor.
- It uses a base model (a template) that the algorithm will "learn".
- Each pixel remaining in the contour image will vote for another pixel which will supposedly be the center (in terms of gravity) of your object, based on what it learned from the model.

In the end, you end up with a heat map of the votes, for example here all the pixels of the contour of the can will vote for its gravitational center, so you'll have a lot of votes in the same pixel corresponding to the center, and will see a peak in the heat map as below.

Once you have that, a simple threshold-based heuristic can give you the location of the center pixel, from which you can derive the scale and rotation and then plot your little rectangle around it (final scale and rotation factor will obviously be relative to your original template). In theory at least...

**Results**: Now, while this approach worked in the basic cases, it was severely lacking in some areas:

- It is
**extremely slow**! I'm not stressing this enough. Almost a full day was needed to process the 30 test images, obvisouly because I had a very high scaling factor for rotation and translation, since some of the cans were very small. - It was completely lost when bottles were in the image, and for some reason almost always found the bottle instead of the can (perhaps because bottles were bigger, thus had more pixels, thus more votes)
- Fuzzy images were also no good, since the votes ended up in pixel at random locations around the center, thus ending with a very noisy heat map.
- Invariance in translation and rotation was achieved, but not in orientation, meaning that a can that was not directly facing the camera objective wasn't recognized.

Can you help me improve my **specific** algorithm, using **exclusively** OpenCV features, to resolve the **four specific** issues mentionned?

I hope some people will also learn something out of it as well, after all I think not only people who ask questions should learn :)

Answered By: stacker ( 133)

An alternative approach would be to extract features (keypoints) using the scale-invariant feature transform (SIFT) or Speeded Up Robust Features (SURF).

It is implemented in OpenCV 2.3.1.

You can find a nice code example using features in *Features2D + Homography to find a known object*

Both algorithms are invariant to scaling and rotation. Since they work with features, you can also handle occlusion (as long as enough keypoints are visible).

Image source: tutorial example

The processing takes a few hundred ms for SIFT, SURF is bit faster, but it not suitable for real-time applications. ORB uses FAST which is weaker regarding rotation invariance.

I have a computer with 1M of RAM and no other local storage. I must use it to accept 1 million 8-digit decimal numbers over a TCP connection, sort them, and then send the sorted list out over another TCP connection. The list of numbers may contain duplicates, which I must not discard. The code will be placed in ROM, so I need not subtract the size of my code from the 1M. I already have code to drive the ethernet port and handle TCP/IP connections, and it requires 2k for its state data, including a 1k buffer via which the code will read and write data. Is there a solution to this problem?

**Sources Of Question And Answer:**

http://tech.slashdot.org/comments.pl?sid=232757&cid=18925745

http://nick.cleaton.net/ramsort.html

Answered By: preshing ( 330)

Here's some working C++ code which solves the problem.

Proof that the memory constraints are satisified:

```
typedef unsigned int u32;
namespace WorkArea
{
static const u32 circularSize = 253250;
u32 circular[circularSize] = { 0 }; // consumes 1013000 bytes
static const u32 stageSize = 8000;
u32 stage[stageSize]; // consumes 32000 bytes
...
```

Together, these two arrays take 1045000 bytes of storage. That leaves 1048576 - 1045000 - 2×1024 = 1528 bytes for remaining variables and stack space.

It runs in about 23 seconds on my Xeon W3520. You can verify that the program works using the following Python script, assuming a program name of `sort1mb.exe`

.

```
from subprocess import *
import random
sequence = [random.randint(0, 99999999) for i in xrange(1000000)]
sorter = Popen('sort1mb.exe', stdin=PIPE, stdout=PIPE)
for value in sequence:
sorter.stdin.write('%08d\n' % value)
sorter.stdin.close()
result = [int(line) for line in sorter.stdout]
print('OK!' if result == sorted(sequence) else 'Error!')
```

A detailed explanation of the algorithm can be found in the following series of posts:

It is an interview question:

Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.

My analysis:

The size of the file is 4×10^{9}×4 bytes = 16 GB.

We can do external sorting, thus we get to know the range of the integers. My question is what is the best way to detect the missing integer in the sorted big integer sets?

My understanding(after reading all answers):

Assuming we are talking about 32-bit integers. There are 2^32 = 4*10^{9} distinct integers.

Case 1: we have 1 GB = 1 * 10^{9} * 8 bits = 8 billion bits memory.
Solution: if we use one bit representing one distinct integer, it is enough. we don't
need sort.
Implementation:

```
int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
Scanner in = new Scanner(new FileReader("a.txt"));
while(in.hasNextInt()){
int n = in.nextInt();
bitfield[n/radix] |= (1 << (n%radix));
}
for(int i = 0; i< bitfield.lenght; i++){
for(int j =0; j<radix; j++){
if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
}
}
}
```

Case 2: 10 MB memory = 10 * 10^{6} * 8 bits = 80 million bits

```
Solution: For all possible 16-bit prefixes, there are 2^16 number of
integers = 65536, we need 2^16 * 4 * 8 = 2 million bits. We need build
65536 buckets. For each bucket, we need 4 bytes holding all possibilities because
the worst case is all the 4 billion integers belong to the same bucket.
step1: Build the counter of each bucket through the first pass through the file.
step2: Scan the buckets, find the first one who has less than 65536 hit.
step3: Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
step4: Scan the buckets built in step3, find the first bucket which doesnt
have a hit.
The code is very similar to above one.
```

Conclusion: We decrease memory through increasing file pass.

*A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file -- at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing. Sorry.*

Answered By: Henning Makholm ( 345)

**Assuming that "integer" means 32 bits**: Having 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 2^16 times. Do a second pass to find of which of the possible numbers in that bucket are used already.

**If it means more than 32 bits, but still of bounded size**: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.

**If "integer" means mathematical integer**: Read through the input once and keep track of the ~~largest number~~ length of the longest number you've ever seen. When you're done, output ~~the maximum plus one~~ a random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the *length* of anything that fits in it).

I was trying various methods to implement a program that gives the digits of pi sequentially. I tried the Taylor series method, but it proved to converge extremely slowly (when I compared my result with the online values after some time). Anyway, I am trying better algorithms.

So, while writing the program I got stuck on a problem, as with all algorithms: How do I know that the `n`

digits that I've calculated are accurate?

Answered By: Mysticial ( 963)

Since I'm the current world record holder for the most digits of pi, I'll add my two cents:

Unless you're actually setting a new world record, the common practice is just to verify the computed digits against the known values. So that's simple enough.

In fact, I have a webpage that lists snippets of digits for the purpose of verifying computations against them: http://www.numberworld.org/digits/Pi/

But when you get into world-record territory, there's nothing to compare against.

Historically, the standard approach for verifying that computed digits are correct is to recompute the digits using a second algorithm. So if either computation goes bad, the digits at the end won't match.

This does typically more than double the amount of time needed (since the second algorithm is usually slower). But it's the only way to verify the computed digits once you've wandered into the uncharted territory of never-before-computed digits and a new world record.

Back in the days where supercomputers were setting the records, two different AGM algorithms were commonly used:

These are both `O(N log(N)^2)`

algorithms that were fairly easy to implement.

However, nowadays, things are a bit different. In the last three world records, instead of performing two computations, we performed only one computation using the fastest known formula (Chudnovsky Formula):

This algorithm is much harder to implement, but it is a lot faster than the AGM algorithms.

Then we verify the binary digits using the BBP formulas for digit extraction.

This formula allows you to compute arbitrary binary digits *without* computing all the digits before it. So it is used to verify the last few computed binary digits. Therefore it is ** much** faster than a full computation.

The advantage of this is:

- Only one expensive computation is needed.

The disadvantage is:

- An implementation of the Bailey–Borwein–Plouffe (BBP) formula is needed.
- An additional step is needed to verify the radix conversion from binary to decimal.

_{I've glossed over some details of why verifying the last few digits implies that all the digits are correct. But it is easy to see this since any computation error will propagate to the last digits.}

Now this last step (verifying the conversion) is actually fairly important. One of the previous world record holders ** actually called us out** on this because, initially, I didn't give a sufficient description of how it worked.

So I've pulled this snippet from my blog:

```
N = # of decimal digits desired
p = 64-bit prime number
```

Compute A using base 10 arithmetic and B using binary arithmetic.

If `A = B`

, then with "extremely high probability", the conversion is correct.

For further reading, see my blog post **Pi - 5 Trillion Digits**.

Given a function which produces a random integer in the range 1 to 5, write a function which produces a random integer in the range 1 to 7.

- What is a simple solution?
- What is an effective solution to reduce memory usage or run on a slower CPU?

Answered By: Rob McAfee ( 265)

This is equivalent to Adam Rosenfield's solution, but may be a bit more clear for some readers. It assumes rand5() is a function that returns a statistically random integer in the range 1 through 5 inclusive.

```
int rand7()
{
int vals[5][5] = {
{ 1, 2, 3, 4, 5 },
{ 6, 7, 1, 2, 3 },
{ 4, 5, 6, 7, 1 },
{ 2, 3, 4, 5, 6 },
{ 7, 0, 0, 0, 0 }
};
int result = 0;
while (result == 0)
{
int i = rand5();
int j = rand5();
result = vals[i-1][j-1];
}
return result;
}
```

How does it work? Think of it like this: imagine printing out this double-dimension array on paper, tacking it up to a dart board and randomly throwing darts at it. If you hit a non-zero value, it's a statistically random value between 1 and 7, since there are an equal number of non-zero values to choose from. If you hit a zero, just keep throwing the dart until you hit a non-zero. That's what this code is doing: the i and j indexes randomly select a location on the dart board, and if we don't get a good result, we keep throwing darts.

Like Adam said, this can run forever in the worst case, but statistically the worst case never happens. :)

I feel a bit thick at this point. I've spent days trying to fully wrap my head around suffix tree construction, but because I don't have a mathematical background, many of the explanations elude me as they start to make excessive use of mathematical symbology. The closest to a good explanation that I've found is *Fast String Searching With Suffix Trees*, but he glosses over various points and some aspects of the algorithm remain unclear.

A step-by-step explanation of this algorithm here on Stack Overflow would be invaluable for many others besides me, I'm sure.

For reference, here's Ukkonen's paper on the algorithm: http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

My basic understanding, so far:

- I need to iterate through each prefix P of a given string T
- I need to iterate through each suffix S in prefix P and add that to tree
- To add suffix S to the tree, I need to iterate through each character in S, with the iterations consisting of either walking down an existing branch that starts with the same set of characters C in S and potentially splitting an edge into descendent nodes when I reach a differing character in the suffix, OR if there was no matching edge to walk down. When no matching edge is found to walk down for C, a new leaf edge is created for C.

The basic algorithm appears to be O(n^{2}), as is pointed out in most explanations, as we need to step through all of the prefixes, then we need to step through each of the suffixes for each prefix. Ukkonen's algorithm is apparently unique because of the suffix pointer technique he uses, though I think *that* is what I'm having trouble understanding.

I'm also having trouble understanding:

- exactly when and how the "active point" is assigned, used and changed
- what is going on with the canonization aspect of the algorithm
- Why the implementations I've seen need to "fix" bounding variables that they are using

**EDIT** (April 13, 2012)

Here is the completed source code that I've written and output based on jogojapan's answer below. The code outputs a detailed description and text-based diagram of the steps it takes as it builds the tree. It is a first version and could probably do with optimization and so forth, but it works, which is the main thing.

[Redacted URL, see updated link below]

**EDIT** (April 15, 2012)

The source code has been completely rewritten from scratch and now not only works correctly, but it supports automatic canonization and renders a nicer looking text graph of the output. Source code and sample output is at:

Answered By: jogojapan ( 890)

The following is an attempt to describe the Ukkonen algorithm by first showing what it does when the string is simple (i.e. does not contain any repeated characters), and then extending it to the full algorithm.

**First, a few preliminary statements.**

What we are building, is

*basically*like a search trie. So there is a root node, edges going out of it leading to new nodes, and further edges going out of those, and so forth**But**: Unlike in a search trie, the edge labels are not single characters. Instead, each edge is labeled using a pair of integers`[from,to]`

. These are pointers into the text. In this sense, each edge carries a string label of arbitrary length, but takes only O(1) space (two pointers).

I would like to first demonstrate how to create the suffix tree of a particularly simple string, a string with no repeated characters:

```
abc
```

The algorithm **works in steps, from left to right**. There is **one step for every character of the string**. Each step might involve more than one individual operation, but we will see (see the final observations at the end) that the total number of operations is O(n).

So, we start from the *left*, and first insert only the single character
`a`

by creating an edge from the root node (on the left) to a leaf,
and labeling it as `[0,#]`

, which means the edge represents the
substring starting at position 0 and ending at *the current end*. I
use the symbol `#`

to mean *the current end*, which is at position 1
(right after `a`

).

So we have an initial tree, which looks like this:

And what it means is this:

Now we progress to position 2 (right after `b`

). **Our goal at each step**
is to insert **all suffixes up to the current position**. We do this
by

- expanding the existing
`a`

-edge to`ab`

- inserting one new edge for
`b`

In our representation this looks like

And what it means is:

**We observe** two things:

- The edge representation for
`ab`

is**the same**as it used to be in the initial tree:`[0,#]`

. Its meaning has automatically changed because we updated the current position`#`

from 1 to 2. - Each edge consumes O(1) space, because it consists of only two pointers into the text, regardless of how many characters it represents.

Next we increment the position again and update the tree by appending
a `c`

to every existing edge and inserting one new edge for the new
suffix `c`

.

In our representation this looks like

And what it means is:

**We observe:**

- The tree is the correct suffix tree
*up to the current position*after each step - There are as many steps as there are characters in the text
- The amount of work in each step is O(1), because all existing edges
are updated automatically by incrementing
`#`

, and inserting the one new edge for the final character can be done in O(1) time. Hence for a string of length n, only O(n) time is required.

Of course this works so nicely only because our string does not contain any repetitions. We now look at a more realistic string:

```
abcabxabcd
```

It starts with `abc`

as in the previous example, then `ab`

is repeated
and followed by `x`

, and then `abc`

is repeated followed by `d`

.

**Steps 1 through 3:** After the first 3 steps we have the tree from the previous example:

**Step 4:** We move `#`

to position 4. This implicitly updates all existing
edges to this:

and we need to insert the final suffix of the current step, `a`

, at
the root.

Before we do this, we introduce **two more variables** (in addition to
`#`

), which of course have been there all the time but we haven't used
them so far:

- The
**active point**, which is a triple`(active_node,active_edge,active_length)`

- The
`remainder`

, which is an integer indicating how many new suffixes we need to insert

The exact meaning of these two will become clear soon, but for now let's just say:

- In the simple
`abc`

example, the active point was always`(root,'\0x',0)`

, i.e.`active_node`

was the root node,`active_edge`

was specified as the null character`'\0x'`

, and`active_length`

was zero. The effect of this was that the one new edge that we inserted in every step was inserted at the root node as a freshly created edge. We will see soon why a triple is necessary to represent this information. - The
`remainder`

was always set to 1 at the beginning of each step. The meaning of this was that the number of suffixes we had to actively insert at the end of each step was 1 (always just the final character).

Now this is going to change. When we insert the current final
character `a`

at the root, we notice that there is already an outgoing
edge starting with `a`

, specifically: `abca`

. Here is what we do in
such a case:

- We do
**not**insert a fresh edge at the root node. - Instead we simply notice that the suffix
`a`

is already in our tree. It ends in the middle of a longer edge, but we are not bothered by that. We just leave things the way they are. - We
**set the active point**to`(root,'a',1)`

. That means the active point is now somewhere in the middle of outgoing edge of the root node that starts with`a`

, specifically, after position 1 on that edge. We notice that the edge is specified simply by its first character`a`

. That suffices because there can be*only one*edge starting with any particular character (confirm that this is true after reading through the entire description). - We also increment
`remainder`

, so at the beginning of the next step it will be 2.

**Observation:** When the final **suffix we need to insert is found to
exist in the tree already**, the tree itself is **not changed** at all (we only update the active point and `remainder`

). The tree
is then not an accurate representation of the suffix tree *up to the
current position* any more, but it **contains** all suffixes (because the final
suffix `a`

is contained *implicitly*). Hence, apart from updating the
variables (which are all of fixed length, so this is O(1)), there was
**no work** done in this step.

**Step 5:** We update the current position `#`

to 5. This
automatically updates the tree to this:

And **because remainder is 2**, we need to insert two final
suffixes of the current position:

`ab`

and `b`

. This is basically because:- The
`a`

suffix from the previous step has never been properly inserted. So it has*remained*, and since we have progressed one step, it has now grown from`a`

to`ab`

. - And we need to insert the new final edge
`b`

.

In practice this means that we go to the active point (which points to
behind the `a`

on what is now the `abcab`

edge), and insert the
current final character `b`

. **But:** Again, it turns out that `b`

is
also already present on that same edge.

So, again, we do not change the tree. We simply:

- Update the active point to
`(root,'a',2)`

(same node and edge as before, but now we point to behind the`b`

) - Increment the
`remainder`

to 3 because we still have not properly inserted the final edge from the previous step, and we don't insert the current final edge either.

To be clear: We had to insert `ab`

and `b`

in the current step, but
because `ab`

was already found, we updated the active point and did
not even attempt to insert `b`

. Why? Because if `ab`

is in the tree,
**every suffix** of it (including `b`

) must be in the tree,
too. Perhaps only *implicitly*, but it must be there, because of the
way we have built the tree so far.

We proceed to **step 6** by incrementing `#`

. The tree is
automatically updated to:

Because ** remainder is 3**, we have to insert

`abx`

, `bx`

and
`x`

. The active point tells us where `ab`

ends, so we only need to
jump there and insert the `x`

. Indeed, `x`

is not there yet, so we
split the `abcabx`

edge and insert an internal node:The edge representations are still pointers into the text, so splitting and inserting an internal node can be done in O(1) time.

So we have dealt with `abx`

and decrement `remainder`

to 2. Now we
need to insert the next remaining suffix, `bx`

. But before we do that
we need to update the active point. The rule for this, after splitting
and inserting an edge, will be called **Rule 1** below, and it applies whenever the
`active_node`

is root (we will learn rule 3 for other cases further
below). Here is rule 1:

After an insertion from root,

`active_node`

remains root`active_edge`

is set to the first character of the new suffix we need to insert, i.e.`b`

`active_length`

is reduced by 1

Hence, the new active-point triple `(root,'b',1)`

indicates that the
next insert has to be made at the `bcabx`

edge, behind 1 character,
i.e. behind `b`

. We can identify the insertion point in O(1) time and
check whether `x`

is already present or not. If it was present, we
would end the current step and leave everything the way it is. But `x`

is not present, so we insert it by splitting the edge:

Again, this took O(1) time and we update `remainder`

to 1 and the
active point to `(root,'x',0)`

as rule 1 states.

But there is one more thing we need to do. We'll call this **Rule 2:**

If we split an edge and insert a new node, and if that is

not the first nodecreated during the current step, we connect the previously inserted node and the new node through a special pointer, asuffix link. We will later see why that is useful. Here is what we get, the suffix link is represented as a dotted edge:

We still need to insert the final suffix of the current step,
`x`

. Since the `active_length`

component of the active node has fallen
to 0, the final insert is made at the root directly. Since there is no
outgoing edge at the root node starting with `x`

, we insert a new
edge:

As we can see, in the current step all remaining inserts were made.

We proceed to **step 7** by setting `#`

=7, which automatically appends the next character,
`a`

, to all leaf edges, as always. Then we attempt to insert the new final
character to the active point (the root), and find that it is there
already. So we end the current step without inserting anything and
update the active point to `(root,'a',1)`

.

In **step 8**,, `#`

=8, we append `b`

, and as seen before, this only
means we update the active point to `(root,'a',2)`

and increment `remainder`

without doing
anything else, because `b`

is already present. **However,** we notice (in O(1) time) that the active point
is now at the end of an edge. We reflect this by re-setting it to
`(node1,'\0x',0)`

. Here, I use `node1`

to refer to the
internal node the `ab`

edge ends at.

Then, in **step #=9**, we need to insert 'c' and this will help us to
understand the final trick:

As always, the `#`

update appends `c`

automatically to the leaf edges
and we go to the active point to see if we can insert 'c'. It turns
out 'c' exists already at that edge, so we set the active point to
`(node1,'c',1)`

, increment `remainder`

and do nothing else.

Now in **step #=10**,

`remainder is 4`

, and so we first need to insert
`abcd`

(which remains from 3 steps ago) by inserting `d`

at the active
point.Attempting to insert `d`

at the active point causes an edge split in
O(1) time:

The `active_node`

, from which the split was initiated, is marked in
red above. Here is the final rule, **Rule 3:**

After splitting an edge from an

`active_node`

that is not the root node, we follow the suffix link going out of that node, if there is any, and reset the`active_node`

to the node it points to. If there is no suffix link, we set the`active_node`

to the root.`active_edge`

and`active_length`

remain unchanged.

So the active point is now `(node2,'c',1)`

, and `node2`

is marked in
red below:

Since the insertion of `abcd`

is complete, we decrement `remainder`

to
3 and consider the next remaining suffix of the current step,
`bcd`

. Rule 3 has set the active point to just the right node and edge
so inserting `bcd`

can be done by simply inserting its final character
`d`

at the active point.

Doing this causes another edge split, and **because of rule 2**, we
must create a suffix link from the previously inserted node to the new
one:

**We observe:** Suffix links enable us to reset the active point so we
can make the next *remaining insert* at O(1) effort. Look at the
graph above to confirm that indeed node at label `ab`

is linked to
the node at `b`

(its suffix), and the node at `abc`

is linked to
`bc`

.

The current step is not finished yet. `remainder`

is now 2, and we
need to follow rule 3 to reset the active point again. Since the
current `active_node`

(red above) has no suffix link, we reset to
root. The active point is now `(root,'c',1)`

.

Hence the next insert occurs at the one outgoing edge of the root node
whose label starts with `c`

: `cabxabcd`

, behind the first character,
i.e. behind `c`

. This causes another split:

And since this involves the creation of a new internal node,we follow rule 2 and set a new suffix link from the previously created internal node:

(I am using Graphviz Dot for these little graphs. The new suffix link caused dot to re-arrange the existing edges, so check carefully to confirm that the only thing that was inserted above is a new suffix link.)

With this, `remainder`

can be set to 1 and since the `active_node`

is
root, we use rule 1 to update the active point to `(root,'d',0)`

. This
means the final insert of the current step is to insert a single `d`

at root:

That was the final step and we are done. There are number of **final
observations**, though:

In each step we move

`#`

forward by 1 position. This automatically updates all leaf nodes in O(1) time.But it does not deal with a) any suffixes

*remaining*from previous steps, and b) with the one final character of the current step.`remainder`

tells us how many additional inserts we need to make. These inserts correspond one-to-one to the final suffixes of the string that ends at the current position`#`

. We consider one after the other and make the insert.**Important:**Each insert is done in O(1) time since the active point tells us exactly where to go, and we need to add only one single character at the active point. Why? Because the other characters are*contained implicitly*(otherwise the active point would not be where it is).After each such insert, we decrement

`remainder`

and follow the suffix link if there is any. If not we go to root (rule 3). If we are at root already, we modify the active point using rule 1. In any case, it takes only O(1) time.If, during one of these inserts, we find that the character we want to insert is already there, we don't do anything and end the current step, even if

`remainder`

>0. The reason is that any inserts that remain will be suffixes of the one we just tried to make. Hence they are all*implicit*in the current tree. The fact that`remainder`

>0 makes sure we deal with the remaining suffixes later.What if at the end of the algorithm

`remainder`

>0? This will be the case whenever the end of the text is a substring that occurred somewhere before. In that case we must append one extra character at the end of the string that has not occurred before. In the literature, usually the dollar sign`$`

is used as a symbol for that.**Why does that matter?**--> If later we use the completed suffix tree to search for suffixes, we must accept matches only if they*end at a leaf*. Otherwise we would get a lot of spurious matches, because there are*many*strings*implicitly*contained in the tree that are not actual suffixes of the main string. Forcing`remainder`

to be 0 at the end is essentially a way to ensure that all suffixes end at a leaf node.**However,**if we want to use the tree to search for*general substrings*, not only*suffixes*of the main string, this final step is indeed not required, as suggested by the OP's comment below.So what is the complexity of the entire algorithm? If the text is n characters in length, there are obviously n steps (or n+1 if we add the dollar sign). In each step we either do nothing (other than updating the variables), or we make

`remainder`

inserts, each taking O(1) time. Since`remainder`

indicates how many times we have done nothing in previous steps, and is decremented for every insert that we make now, the total number of times we do something is exactly n (or n+1). Hence, the total complexity is O(n).However, there is one small thing that I did not properly explain: It can happen that we follow a suffix link, update the active point, and then find that its

`active_length`

component does not work well with the new`active_node`

. For example, consider a situation like this:

(The dashed lines indicate the rest of the tree. The dotted line is a suffix link.)

Now let the active point be `(red,'d',3)`

, so it points to the place
behind the `f`

on the `defg`

edge. Now assume we made the necessary
updates and now follow the suffix link to update the active point
according to rule 3. The new active point is `(green,'d',3)`

. However,
the `d`

-edge going out of the green node is `de`

, so it has only 2
characters. In order to find the correct active point, we obviously
need to follow that edge to the blue node and reset to `(blue,'f',1)`

.

In a particularly bad case, the `active_length`

could be as large as
`remainder`

, which can be as large as n. And it might very well happen
that to find the correct active point, we need not only jump over one
internal node, but perhaps many, up to n in the worst case. Does that
mean the algorithm has a hidden O(n^{2}) complexity, because
in each step `remainder`

is generally O(n), and the post-adjustments
to the active node after following a suffix link could be O(n), too?

No. The reason is that if indeed we have to adjust the active point
(e.g. from green to blue as above), that brings us to a new node that
has its own suffix link, and `active_length`

will be reduced. As
we follow down the chain of suffix links we make the remaining inserts, `active_length`

can only
decrease, and the number of active-point adjustments we can make on
the way can't be larger than `active_length`

at any given time. Since
`active_length`

can never be larger than `remainder`

, and `remainder`

is O(n) not only in every single step, but the total sum of increments
ever made to `remainder`

over the course of the entire process is
O(n) too, the number of active point adjustments is also bounded by
O(n).

How do map providers (such as Google or Yahoo! Maps) suggest directions?

I mean, they probably have real-world data in some form, certainly including distances but also perhaps things like driving speeds, presence of sidewalks, train schedules, etc. But suppose the data were in a simpler format, say a very large directed graph with edge weights reflecting distances. I want to be able to quickly compute directions from one arbitrary point to another. Sometimes these points will be close together (within one city) while sometimes they will be far apart (cross-country).

Graph algorithms like Dijkstra's algorithm will not work because the graph is enormous. Luckily, heuristic algorithms like A* will probably work. However, our data is very structured, and perhaps some kind of tiered approach might work? (For example, store precomputed directions between certain "key" points far apart, as well as some local directions. Then directions for two far-away points will involve local directions to a key points, global directions to another key point, and then local directions again.)

What algorithms are actually used in practice?

PS. This question was motivated by finding quirks in online mapping directions. Contrary to the triangle inequality, sometimes Google Maps thinks that X-Z takes longer and is farther than using an intermediate point as in X-Y-Z. But maybe their walking directions optimize for another parameter, too?

PPS. Here's another violation of the triangle inequality that suggests (to me) that they use some kind of tiered approach: X-Z versus X-Y-Z. The former seems to use prominent Boulevard de Sebastopol even though it's slightly out of the way.

**Edit**: Neither of these examples seem to work anymore, but both did at the time of the original post.

Answered By: Nick Johnson ( 257)

Speaking as someone who spent 18 months working at a mapping company, which included working on the routing algorithm... yes, Dijkstra's does work, with a couple of modifications:

- Instead of doing Dijkstra's once from source to dest, you start at each end, and expand both sides until they meet in the middle. This eliminates roughly half the work (2*pi*(r/2)^2 vs pi*r^2).
- To avoid exploring the back-alleys of every city between your source and destination, you can have several layers of map data: A 'highways' layer that contains only highways, a 'secondary' layer that contains only secondary streets, and so forth. Then, you explore only smaller sections of the more detailed layers, expanding as necessary. Obviously this description leaves out a lot of detail, but you get the idea.

With modifications along those lines, you can do even cross-country routing in a very reasonable timeframe.