Top c Questions

List of Tags

After reading "Hidden Features and Dark Corners of C++/STL" on comp.lang.c++.moderated, I was completely surprised that it compiled and worked in both Visual Studio 2008 and G++ 4.4. The code:

#include <stdio.h>
int main()
     int x = 10;
     while( x --> 0 ) // x goes to 0
       printf("%d ", x);

Where in the standard is this defined, and where did it come from?

I'd assume C, since it works in GCC as well, but I put C++ on there just in case C++ has more to mention on it. On a more subjective note, I've never heard of this before, had anybody else? Is it worth using?

Answered By: Charles Salvia ( 1680)

That's not an operator -->. That's two separate operators, -- and >.

Your condition code is decrementing x, while returning xs original (not decremented) value, and then comparing the original value with 0 using the > operator.

To better understand, the statement could be as follows:

while( (x--) > 0 )

Suppose a1, b1, c1, and d1 point to heap memory and my numerical code has the following core loop.

const int n=100000

for(int j=0;j<n;j++){
    a1[j] += b1[j];
    c1[j] += d1[j];

This loop is executed 10,000 times via another outer for loop. To speed it up, I changed the code to:

for(int j=0;j<n;j++){
    a1[j] += b1[j];
for(int j=0;j<n;j++){
    c1[j] += d1[j];

Compiled on MS Visual C++ 10.0 with full optimization and SSE2 enabled for 32-bit on a Intel Core 2 Duo (x64), the first example takes 5.5 seconds and the double-loop example takes only 1.9 seconds. My question is: (Please refer to the my rephrased question at the bottom)

PS: I am not sure, if this helps:

Disassembly for the first loop basically looks like this (this block is repeated about five times in the full program):

movsd       xmm0,mmword ptr [edx+18h]
addsd       xmm0,mmword ptr [ecx+20h]
movsd       mmword ptr [ecx+20h],xmm0
movsd       xmm0,mmword ptr [esi+10h]
addsd       xmm0,mmword ptr [eax+30h]
movsd       mmword ptr [eax+30h],xmm0
movsd       xmm0,mmword ptr [edx+20h]
addsd       xmm0,mmword ptr [ecx+28h]
movsd       mmword ptr [ecx+28h],xmm0
movsd       xmm0,mmword ptr [esi+18h]
addsd       xmm0,mmword ptr [eax+38h]

Each loop of the double loop example produces this code (the following block is repeated about three times):

addsd       xmm0,mmword ptr [eax+28h]
movsd       mmword ptr [eax+28h],xmm0
movsd       xmm0,mmword ptr [ecx+20h]
addsd       xmm0,mmword ptr [eax+30h]
movsd       mmword ptr [eax+30h],xmm0
movsd       xmm0,mmword ptr [ecx+28h]
addsd       xmm0,mmword ptr [eax+38h]
movsd       mmword ptr [eax+38h],xmm0
movsd       xmm0,mmword ptr [ecx+30h]
addsd       xmm0,mmword ptr [eax+40h]
movsd       mmword ptr [eax+40h],xmm0

EDIT: The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:

Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?

It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.

PPS: The full code is at It uses TBB Tick_Count for higher resolution timing, which can be disabled by not defining the TBB_TIMING Macro.

(It shows FLOP/s for different values of n.)

enter image description here

Answered By: Mysticial ( 626)

Upon further analysis of this, I believe this is (at least partially) caused by data alignment of the four pointers. This will cause some level of cache bank/way conflicts.

If I've guessed correctly on how you are allocating your arrays, they are likely to be aligned to the page line.

This means that all your accesses in each loop will fall on the same cache way. However, Intel processors have had 8-way L1 cache associativity for a while. But in reality, the performance isn't completely uniform. Accessing 4-ways is still slower than say 2-ways.

EDIT : It does in fact look like you are allocating all the arrays separately. Usually when such large allocations are requested, the allocator will request fresh pages from the OS. Therefore, there is a high chance that large allocations will appear at the same offset from a page-boundary.

Here's the test code:

int main(){
    const int n = 100000;

    double *a1 = (double*)malloc(n * sizeof(double));
    double *b1 = (double*)malloc(n * sizeof(double));
    double *c1 = (double*)malloc(n * sizeof(double));
    double *d1 = (double*)malloc(n * sizeof(double));
    double *a1 = (double*)malloc(n * sizeof(double) * 4);
    double *b1 = a1 + n;
    double *c1 = b1 + n;
    double *d1 = c1 + n;

    //  Zero the data to prevent any chance of denormals.
    memset(a1,0,n * sizeof(double));
    memset(b1,0,n * sizeof(double));
    memset(c1,0,n * sizeof(double));
    memset(d1,0,n * sizeof(double));

    //  Print the addresses
    cout << a1 << endl;
    cout << b1 << endl;
    cout << c1 << endl;
    cout << d1 << endl;

    clock_t start = clock();

    int c = 0;
    while (c++ < 10000){

        for(int j=0;j<n;j++){
            a1[j] += b1[j];
            c1[j] += d1[j];
        for(int j=0;j<n;j++){
            a1[j] += b1[j];
        for(int j=0;j<n;j++){
            c1[j] += d1[j];


    clock_t end = clock();
    cout << "seconds = " << (double)(end - start) / CLOCKS_PER_SEC << endl;

    return 0;

Benchmark Results:

EDIT: Results on an actual Core 2 architecture machine:

2 x Intel Xeon X5482 Harpertown @ 3.2 GHz:

#define ONE_LOOP
seconds = 6.206

//#define ONE_LOOP
seconds = 2.116

#define ONE_LOOP
seconds = 1.894

//#define ONE_LOOP
seconds = 1.993


  • 6.206 seconds with one loop and 2.116 seconds with two loops. This reproduces the OP's results exactly.

  • In the first two tests, the arrays are allocated separately. You'll notice that they all have the same alignment relative to the page.

  • In the second two tests, the arrays are packed together to break that alignment. Here you'll notice both loops are faster. Furthermore, the second (double) loop is now the slower one as you would normally expect.

As @Stephen Cannon points out in the comments, there is very likely possibility that this alignment causes false aliasing in the load/store units or the cache. I Googled around for this and found that Intel actually has a hardware counter for partial address aliasing stalls:

5 Regions - Explanations

Region 1:

This one is easy. The dataset is so small that the performance is dominated by overhead like looping and branching.

Region 2:

Here, as the data sizes increases, the amount of relative overhead goes down and the performance "saturates". Here two loops is slower because it has twice as much loop and branching overhead.

I'm not sure exactly what's going on here... Alignment could still play an effect as Agner Fog mentions cache bank conflicts. (That link is about Sandy Bridge, but the idea should still be applicable to Core 2.)

Region 3:

At this point, the data no longer fits in L1 cache. So performance is capped by the L1 <-> L2 cache bandwidth.

Region 4:

The performance drop in the single-loop is what we are observing. And as mentioned, this is due to the alignment which (most likely) causes false aliasing stalls in the processor load/store units.

However, in order for false aliasing to occur, there must be a large enough stride between the datasets. This is why you don't see this in region 3.

Region 5:

At this point, nothing fits in cache. So you're bound by memory bandwidth.

2 x Intel X5482 Harpertown @ 3.2 GHz Intel Core i7 870 @ 2.8 GHz Intel Core i7 2600K @ 4.4 GHz

I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call pow(a,2) by compiling it into a*a, but the call pow(a,6) is not optimized and will actually call the library function pow, which greatly slows down the performance. (In contrast, Intel C++ Compiler, executable icc, will eliminate the library call for pow(a,6).)

What I am curious about is that when I replaced pow(a,6) with a*a*a*a*a*a using GCC 4.5.1 and options "-O3 -lm -funroll-loops -msse4", it uses 5 mulsd instructions:

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13

while if I write (a*a*a)*(a*a*a), it will produce

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm13, %xmm13

which reduces the number of multiply instructions to 3. icc has similar behavior.

Why do compilers not recognize this optimization trick?

Answered By: Lambdageek ( 1069)

Because Floating Point Math is not Associative. The way you group the operands in floating point multiplication has an effect on the numerical accuracy of the answer.

As a result, most compilers are very conservative about reordering floating point calculations unless they can be sure that the answer will stay the same, or unless you tell them you don't care about numerical accuracy. For example: the -ffast-math option of gcc.


I bumped into this strange macro code in /usr/include/linux/kernel.h:

/* Force a compilation error if condition is true, but also produce a
   result (of value 0 and type size_t), so the expression can be used
   e.g. in a structure initializer (or where-ever else comma expressions
   aren't permitted). */
#define BUILD_BUG_ON_ZERO(e) (sizeof(struct { int:-!!(e); }))
#define BUILD_BUG_ON_NULL(e) ((void *)sizeof(struct { int:-!!(e); }))

What does :-!! do?

Answered By: John Feminella ( 648)

This is, in effect, a way to check whether the expression e can be evaluated to be 0, and if not, to fail the build.

The macro is somewhat misnamed; it should be something more like BUILD_BUG_OR_ZERO, rather than ...ON_ZERO. (There have been occasional discussions about whether this is a confusing name.)

You should read the expression like this:

sizeof(struct { int: -!!(e); }))
  1. (e): Declare an expression e.

  2. !!(e): Negate it twice. This produces 0 if e was 0 originally, or a nonzero positive number if it wasn't.

  3. -!!(e): Multiply the value by -1. This results in 0 if step 2 was 0, or a negative number if it wasn't.

  4. struct{int: -!!(0);} --> struct{int: 0;}: If it was zero, then we declare a struct with an integer bitfield that has width zero. Everything is fine and we proceed as normal.

  5. struct{int: -!!(1);} --> struct{int: -1;}: On the other hand, if it isn't zero, then it will be some negative number. Declaring a bitfield with negative width is a compilation error.

So we'll either wind up with a bitfield that has width 0 in a struct, which is fine, or a bitfield with negative width, which is a compilation error. Then we take sizeof that field, so we get a size_t with the appropriate width (which will be zero in the case where e is zero).

Some people have asked: Why not just use an assert?

keithmo's answer here has a good response:

These macros implement a compile-time test, while assert() is a run-time test.

Exactly right. You don't want to detect problems in your kernel at runtime that could have been caught earlier! It's a critical piece of the operating system. To whatever extent problems can be detected at compile time, so much the better.


I don't think I fundamentally understand what an enum is, and when to use it.

For example:

typedef enum {
} ShapeType;

What is really being declared here?

Answered By: Adam Rosenfield ( 888)

Three things are being declared here: an anonymous enumerated type is declared, ShapeType is being declared a typedef for that anonymous enumeration, and the three names kCircle, kRectangle, and kOblateSpheroid are being declared as integral constants.

Let's break that down. In the simplest case, an enumeration can be declared as

enum tagname { ... };

This declares an enumeration with the tag tagname. In C and Objective-C (but not C++), any references to this must be preceded with the enum keyword. For example:

enum tagname x;  // declare x of type 'enum tagname'
tagname x;  // ERROR in C/Objective-C, OK in C++

In order to avoid having to use the enum keyword everywhere, a typedef can be created:

enum tagname { ... };
typedef enum tagname tagname;  // declare 'tagname' as a typedef for 'enum tagname'

This can be simplified into one line:

typedef enum tagname { ... } tagname;  // declare both 'enum tagname' and 'tagname'

And finally, if we don't need to be able to use enum tagname with the enum keyword, we can make the enum anonymous and only declare it with the typedef name:

typedef enum { ... } tagname;

Now, in this case, we're declaring ShapeType to be a typedef'ed name of an anonymous enumeration. ShapeType is really just an integral type, and should only be used to declare variables which hold one of the values listed in the declaration (that is, one of kCircle, kRectangle, and kOblateSpheroid). You can assign a ShapeType variable another value by casting, though, so you have to be careful when reading enum values.

Finally, kCircle, kRectangle, and kOblateSpheroid are declared as integral constants in the global namespace. Since no specific values were specified, they get assigned to consecutive integers starting with 0, so kCircle is 0, kRectangle is 1, and kOblateSpheroid is 2.

As Joel points out in Stack Overflow podcast #34, in C Programming Language (aka: K & R), there is mention of this property of arrays in C: a[5] == 5[a]

Joel says that it's because of pointer arithmetic but I still don't understand. Why does a[5] == 5[a] ?

Edit: The accepted answer is great. For a lower level view of how this works, see the comments section on that answer. There's a phenomenal conversation there about it. (This edit written about the comments available at the time. ie: the first ~16)

Answered By: Mehrdad Afshari ( 582)

Because a[5] will evaluate to:

*(a + 5)

and 5[a] will evaluate to:

*(5 + a)

and from elementary school math we know those are equal.

This is the direct artifact of arrays behaving as pointers, "a" is a memory address. "a[5]" is the value that's 5 elements further from "a". The address of this element is "a + 5". This is equal to offset "a" from "5" elements at the beginning of the address space (5 + a).

c, math

This question has been asked in an Oracle interview.

How would you divide a number by 3 without using *, /, +, -, %, operators?

The number may be signed or unsigned.

Answered By: Coodey ( 391)

There is a simple function I found here. But it's using the + operator, so you have to add the values with the bit-operators:

// replaces the + operator
int add(int x, int y) {
    int a, b;
    do {
        a = x & y;
        b = x ^ y;
        x = a << 1;
        y = b;
    } while (a);
    return b;

int divideby3 (int num) {
    int sum = 0;
    while (num > 3) {
        sum = add(num >> 2, sum);
        num = add(num >> 2, num & 3);
    if (num == 3)
        sum = add(sum, 1);
    return sum; 

As Jim commented this works because:

  • n = 4 * a + b
  • n / 3 = a + (a + b) / 3
  • So sum += a, n = a + b, and iterate
  • When a == 0 (n < 4), sum += floor(n / 3); i.e. 1, if n == 3, else 0

How to set, clear and toggle a bit in C?

Answered By: Jeremy Ruten ( 713)

Setting a bit

Use the bitwise OR operator (|) to set a bit.

number |= 1 << x;

That will set bit x.

Clearing a bit

Use the bitwise AND operator (&) to clear a bit.

number &= ~(1 << x);

That will clear bit x. You must invert the bit string with the bitwise NOT operator (~), then AND it.

Toggling a bit

The XOR operator (^) can be used to toggle a bit.

number ^= 1 << x;

That will toggle bit x.

Checking a bit

You didn't ask for this but I might as well add it.

To check a bit, AND it with the bit you want to check:

bit = number & (1 << x);

That will put the value of bit x into the variable bit.

Mike Willekes

Optimizing SQLite is tricky. Bulk-insert performance of a C application can vary from 85 inserts-per-second to over 96 000 inserts-per-second!

Background: We are using SQLite as part of a desktop application. We have large amounts of configuration data stored in XML files that are parsed and loaded into an SQLite database for further processing when the application is initialized. SQLite is ideal for this situation because it's fast, it requires no specialized configuration and the database is stored on disk as a single file.

Rationale: Initially I was disappointed with the performance I was seeing. It turns-out that the performance of SQLite can vary significnatly (both for bulk-inserts and selects) depending on how the database is configured and how you're using the API. It was not a trivial matter to figure-out what all of the options and techniques were, so I though it prudent to create this community wiki entry to share the results with SO readers in order to save others the trouble of the same investigations.

The Experiment: Rather than simply talking about performance tips in the general sense (i.e. "Use a transaction!"), I thought it best to write some C code and actually measure the impact of various options. We're going to start with some simple data:

  • A 28 meg TAB-delimited text file (approx 865000 records) of the complete transit schedule for the city of Toronto
  • My test machine is a 3.60 GHz P4 running Windows XP.
  • The code is compiled with MSVC 2005 as "Release" with "Full Optimization" (/Ox) and Favor Fast Code (/Ot).
  • I'm using the SQLite "Amalgamation", compiled directly into my test application. The SQLite version I happen to have is a bit older (3.6.7), but I suspect these results will be comparable to the latest release (please leave a comment if you think otherwise).

Lets write some code!

The Code: A simple C program that reads the text file line-by-line, splits the string into values and then will inserts the data into an SQLite database. In this "baseline" version of the code, the database is created but we won't actually insert data:

    Baseline code to experiment with SQLite performance.

    Input data is a 28 Mb TAB-delimited text file of the
    complete Toronto Transit System schedule/route info 

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include "sqlite3.h"

#define INPUTDATA "C:\\TTC_schedule_scheduleitem_10-27-2009.txt"
#define DATABASE "c:\\TTC_schedule_scheduleitem_10-27-2009.sqlite"
#define BUFFER_SIZE 256

int main(int argc, char **argv) {

    sqlite3 * db;
    sqlite3_stmt * stmt;
    char * sErrMsg = 0;
    char * tail = 0;
    int nRetCode;
    int n = 0;

    clock_t cStartClock;

    FILE * pFile;
    char sInputBuf [BUFFER_SIZE] = "\0";

    char * sRT = 0;  /* Route */
    char * sBR = 0;  /* Branch */
    char * sVR = 0;  /* Version */
    char * sST = 0;  /* Stop Number */
    char * sVI = 0;  /* Vehicle */
    char * sDT = 0;  /* Date */
    char * sTM = 0;  /* Time */

    char sSQL [BUFFER_SIZE] = "\0";

    /* Open the Database and create the Schema */
    sqlite3_open(DATABASE, &db);
    sqlite3_exec(db, TABLE, NULL, NULL, &sErrMsg);

    /* Open input file and import into Database*/
    cStartClock = clock();

    pFile = fopen (INPUTDATA,"r");
    while (!feof(pFile)) {

        fgets (sInputBuf, BUFFER_SIZE, pFile);

        sRT = strtok (sInputBuf, "\t");     /* Get Route */
        sBR = strtok (NULL, "\t");          /* Get Branch */    
        sVR = strtok (NULL, "\t");          /* Get Version */
        sST = strtok (NULL, "\t");          /* Get Stop Number */
        sVI = strtok (NULL, "\t");          /* Get Vehicle */
        sDT = strtok (NULL, "\t");          /* Get Date */
        sTM = strtok (NULL, "\t");          /* Get Time */



    fclose (pFile);

    printf("Imported %d records in %4.2f seconds\n", n, (clock() - cStartClock) / (double)CLOCKS_PER_SEC);

    return 0;

The "Control"

Running the code as-is doesn't actually perform any database operations, but it will give us an idea of how fast the raw C file IO and string processing operations are.

Imported 864913 records in 0.94 seconds

Great! We can do 920 000 inserts-per-second, provided we don't actually do any inserts :-)

The "Worst-Case-Scenario"

We're going to generate the SQL string using the values read from the file and invoke that SQL operation using sqlite3_exec:

sprintf(sSQL, "INSERT INTO TTC VALUES (NULL, '%s', '%s', '%s', '%s', '%s', '%s', '%s')", sRT, sBR, sVR, sST, sVI, sDT, sTM);
sqlite3_exec(db, sSQL, NULL, NULL, &sErrMsg);

This is going to be slow because the SQL will be compiled into VDBE code for every insert and every insert will happen in it's own transaction. How slow?

Imported 864913 records in 9933.61 seconds

Yikes! 1 hour and 45 minutes! That's only 85 inserts-per-second.

Using a Transaction

By default SQLite will evaluate every INSERT / UPDATE statement within a unique transaction. If performing a large number of inserts, it's advisable to wrap your operation in a transaction:

sqlite3_exec(db, "BEGIN TRANSACTION", NULL, NULL, &sErrMsg);

pFile = fopen (INPUTDATA,"r");
while (!feof(pFile)) {


fclose (pFile);

sqlite3_exec(db, "END TRANSACTION", NULL, NULL, &sErrMsg);

Imported 864913 records in 38.03 seconds

That's better. Simply wrapping all of our inserts in a single transaction improved our performance to 23 000 inserts-per-second.

Using a Prepared Statement

Using a transaction was a huge improvement, but recompiling the SQL statement for every insert doesn't make sense if we using the same SQL over-and-over. Let's use sqlite3_prepare_v2 to compile our SQL statement once and then bind our parameters to that statement using sqlite3_bind_text:

/* Open input file and import into Database*/
cStartClock = clock();

sprintf(sSQL, "INSERT INTO TTC VALUES (NULL, @RT, @BR, @VR, @ST, @VI, @DT, @TM)");
sqlite3_prepare_v2(db,  sSQL, BUFFER_SIZE, &stmt, &tail);

sqlite3_exec(db, "BEGIN TRANSACTION", NULL, NULL, &sErrMsg);

pFile = fopen (INPUTDATA,"r");
while (!feof(pFile)) {

    fgets (sInputBuf, BUFFER_SIZE, pFile);

    sRT = strtok (sInputBuf, "\t");     /* Get Route */
    sBR = strtok (NULL, "\t");      /* Get Branch */    
    sVR = strtok (NULL, "\t");      /* Get Version */
    sST = strtok (NULL, "\t");      /* Get Stop Number */
    sVI = strtok (NULL, "\t");      /* Get Vehicle */
    sDT = strtok (NULL, "\t");      /* Get Date */
    sTM = strtok (NULL, "\t");      /* Get Time */

    sqlite3_bind_text(stmt, 1, sRT, -1, SQLITE_TRANSIENT);
    sqlite3_bind_text(stmt, 2, sBR, -1, SQLITE_TRANSIENT);
    sqlite3_bind_text(stmt, 3, sVR, -1, SQLITE_TRANSIENT);
    sqlite3_bind_text(stmt, 4, sST, -1, SQLITE_TRANSIENT);
    sqlite3_bind_text(stmt, 5, sVI, -1, SQLITE_TRANSIENT);
    sqlite3_bind_text(stmt, 6, sDT, -1, SQLITE_TRANSIENT);
    sqlite3_bind_text(stmt, 7, sTM, -1, SQLITE_TRANSIENT);




fclose (pFile);

sqlite3_exec(db, "END TRANSACTION", NULL, NULL, &sErrMsg);

printf("Imported %d records in %4.2f seconds\n", n, (clock() - cStartClock) / (double)CLOCKS_PER_SEC);


return 0;

Imported 864913 records in 16.27 seconds

Nice! There's a little bit more code (don't forget to call sqlite3_clear_bindings and sqlite3_reset) but we've more than doubled our performance to 53 000 inserts-per-second.

PRAGMA synchronous = OFF

By default SQLite will pause after issuing a OS-level write command. This guarantees that the data is written to the disk. By setting synchronous = OFF, we are instructing SQLite to simply hand-off the data to the OS for writing and then continue. There's a chance that the database file may become corrupted if the computer suffers a catastrophic crash (or power failure) before the data is written to the platter:

/* Open the Database and create the Schema */
sqlite3_open(DATABASE, &db);
sqlite3_exec(db, TABLE, NULL, NULL, &sErrMsg);
sqlite3_exec(db, "PRAGMA synchronous = OFF", NULL, NULL, &sErrMsg);

Imported 864913 records in 12.41 seconds

The improvements are now smaller, but we're up to 69 600 inserts-per-second.

PRAGMA journal_mode = MEMORY

Consider storing the rollback journal in memory by evaluating PRAGMA journal_mode = MEMORY. Your transaction will be faster, but if you lose power or your program crashes during a transaction you database could be left in a corrupt state with a partially-completed transaction:

/* Open the Database and create the Schema */
sqlite3_open(DATABASE, &db);
sqlite3_exec(db, TABLE, NULL, NULL, &sErrMsg);
sqlite3_exec(db, "PRAGMA journal_mode = MEMORY", NULL, NULL, &sErrMsg);

Imported 864913 records in 13.50 seconds

A little slower than the previous optimization at 64 000 inserts-per-second.

PRAGMA synchronous = OFF and PRAGMA journal_mode = MEMORY

Let's combine the previous two optimizations. It's a little more risky (in case of a crash), but we're just importing data (not running a bank):

/* Open the Database and create the Schema */
sqlite3_open(DATABASE, &db);
sqlite3_exec(db, TABLE, NULL, NULL, &sErrMsg);
sqlite3_exec(db, "PRAGMA synchronous = OFF", NULL, NULL, &sErrMsg);
sqlite3_exec(db, "PRAGMA journal_mode = MEMORY", NULL, NULL, &sErrMsg);

Imported 864913 records in 12.00 seconds

Fantastic! We're able to do 72 000 inserts-per-second.

Using an In-Memory Database

Just for kicks, let's build upon all of the previous optimizations and redefine the database filename so we're working entirely in RAM:

#define DATABASE ":memory:"

Imported 864913 records in 10.94 seconds

It's not super-practical to store our database in RAM, but it's impressive that we can perform 79 000 inserts-per-second.

Refactoring C Code

Although not specifically an SQLite improvement, I don't like the extra char* assignment operations in the while loop. Let's quickly refactor that code to pass the output of strtok() directly into sqlite3_bind_text() and let the compiler try to speed things up for us:

pFile = fopen (INPUTDATA,"r");
while (!feof(pFile)) {

    fgets (sInputBuf, BUFFER_SIZE, pFile);

    sqlite3_bind_text(stmt, 1, strtok (sInputBuf, "\t"), -1, SQLITE_TRANSIENT); /* Get Route */
    sqlite3_bind_text(stmt, 2, strtok (NULL, "\t"), -1, SQLITE_TRANSIENT);  /* Get Branch */
    sqlite3_bind_text(stmt, 3, strtok (NULL, "\t"), -1, SQLITE_TRANSIENT);  /* Get Version */
    sqlite3_bind_text(stmt, 4, strtok (NULL, "\t"), -1, SQLITE_TRANSIENT);  /* Get Stop Number */
    sqlite3_bind_text(stmt, 5, strtok (NULL, "\t"), -1, SQLITE_TRANSIENT);  /* Get Vehicle */
    sqlite3_bind_text(stmt, 6, strtok (NULL, "\t"), -1, SQLITE_TRANSIENT);  /* Get Date */
    sqlite3_bind_text(stmt, 7, strtok (NULL, "\t"), -1, SQLITE_TRANSIENT);  /* Get Time */

    sqlite3_step(stmt);     /* Execute the SQL Statement */
    sqlite3_clear_bindings(stmt);   /* Clear bindings */
    sqlite3_reset(stmt);        /* Reset VDBE */

fclose (pFile);

Note: We are back to using a real database file. In-memory databases as fast but not necessarily practical

Imported 864913 records in 8.94 seconds

A slight refactoring to the string processing code used in our parameter binding has allowed us to perform 96 700 inserts-per-second. I think it's safe to say that this is plenty fast. As we start to tweak other variables (i.e. page size, index creation, etc.) this will be our benchmark.

Summary (so far)

I hope you're still with me! The reason we started down this road is that bulk-insert performance varies so wildly with SQLite and it's not always obvious what changes need to be made to speed-up our operation. Using the same compiler (and compiler options), the same version of SQLite and the same data we've optimized our code and our usage of SQLite to go from a worst-case scenario of 85 inserts-per-second to over 96 000 inserts-per-second!


Before we start measuring SELECT performance, we know that we'll be creating indexes. It's been suggested in one of the answers below that when doing bulk inserts, it is faster to create the index after the data has been inserted (as opposed to creating the index first then inserting the data). Let's try:

Create Index then Insert Data

sqlite3_exec(db, "CREATE  INDEX 'TTC_Stop_Index' ON 'TTC' ('Stop')", NULL, NULL, &sErrMsg);
sqlite3_exec(db, "BEGIN TRANSACTION", NULL, NULL, &sErrMsg);

Imported 864913 records in 18.13 seconds

Insert Data then Create Index

sqlite3_exec(db, "END TRANSACTION", NULL, NULL, &sErrMsg);
sqlite3_exec(db, "CREATE  INDEX 'TTC_Stop_Index' ON 'TTC' ('Stop')", NULL, NULL, &sErrMsg);

Imported 864913 records in 13.66 seconds

As expected, bulk-inserts are slower if one column is indexed, but it does make a difference if the index is created after the data is inserted. Our no-index baseline is 96 000 insert-per-second. Creating the index first then inserting data gives us 47 700 inserts-per-second, whereas inserting the data first then creating the index gives us 63 300 inserts-per-second.

I'd gladly take suggestions for other scenarios to try... And will be compiling similar data for selects soon.

Answered By: Snazzer ( 84)

Several tips:

  1. Put inserts/updates in a transaction
  2. Consider a less paranoid journal mode (pragma journal_mode). There is NORMAL, and then there OFF which can significantly increase insert speed if you're not too worried about the database possibly getting corrupted if the OS crashes. If your application crashes the data should be fine.
  3. Playing with page sizes makes a difference as well (PRAGMA page_size). Having larger page sizes can make reads and writes go a bit faster as larger pages are held in memory. Note that more memory will be used for your database.
  4. If you have indices, consider calling CREATE INDEX after doing all your inserts. This is significantly faster than creating the index and then doing your inserts.
  5. You have to be quite careful if you have concurrent access to SQLite, as the whole database is locked when writes are done, and although multiple readers are possible, writes will be locked out.
  6. Take advantage of saving space...smaller databases go faster. For instance, if you have key value pairs, try making the key an INTEGER PRIMARY KEY if possible, which will replace the implied unique row number column in the table.
  7. If you are using multiple threads, you can try using the shared page cache, which will allow loaded pages to be shared between threads, which can avoid expensive I/O calls.

I've also asked similar questions here and here.

How does this C program work?


It compiles as it is (tested on gcc 4.6.3). It prints the time when compiled. On my system:

    !!  !!!!!!              !!  !!!!!!              !!  !!!!!! 
    !!  !!  !!              !!      !!              !!  !!  !! 
    !!  !!  !!              !!      !!              !!  !!  !! 
    !!  !!!!!!    !!        !!      !!    !!        !!  !!!!!! 
    !!      !!              !!      !!              !!  !!  !! 
    !!      !!              !!      !!              !!  !!  !! 
    !!  !!!!!!              !!      !!              !!  !!!!!!

Source: sykes2 - A clock in one line, sykes2 author hints

Some hints: No compile warnings per default. Compiled with -Wall, the following warnings are emitted:

sykes2.c:1:1: warning: return type defaults to ‘int’ [-Wreturn-type]
sykes2.c: In function ‘main’:
sykes2.c:1:14: warning: value computed is not used [-Wunused-value]
sykes2.c:1:1: warning: implicit declaration of function ‘putchar’ [-Wimplicit-function-declaration]
sykes2.c:1:1: warning: suggest parentheses around arithmetic in operand of ‘|’ [-Wparentheses]
sykes2.c:1:1: warning: suggest parentheses around arithmetic in operand of ‘|’ [-Wparentheses]
sykes2.c:1:1: warning: control reaches end of non-void function [-Wreturn-type]
Answered By: nneonneo ( 847)

Let's de-obfuscate it.


main(_) {
    _^448 && main(-~_);
        ? 32 | -~7[__TIME__-_/8%8][">'txiZ^(~z?"-48] >> ";;;====~$::199"[_*2&8|_/64]/(_&2?1:8)%8&1
        : 10);

Introducing variables to untangle this mess:

main(int i) {
    if(--i % 64) {
        char a = -~7[__TIME__-i/8%8][">'txiZ^(~z?"-48];
        char b = a >> ";;;====~$::199"[i*2&8|i/64]/(i&2?1:8)%8;
        putchar(32 | (b & 1));
    } else {
        putchar(10); // newline

Note that -~i == i+1 because of twos-complement. Therefore, we have

main(int i) {
    if(i != 448)
    if(i % 64 == 0) {
    } else {
        char a = -~7[__TIME__-i/8%8][">'txiZ^(~z?"-48];
        char b = a >> ";;;====~$::199"[i*2&8|i/64]/(i&2?1:8)%8;
        putchar(32 | (b & 1));

Now, note that a[b] is the same as b[a], and apply the -~ == 1+ change again:

main(int i) {
    if(i != 448)
    if(i % 64 == 0) {
    } else {
        char a = (">'txiZ^(~z?"-48)[(__TIME__-i/8%8)[7]] + 1;
        char b = a >> ";;;====~$::199"[(i*2&8)|i/64]/(i&2?1:8)%8;
        putchar(32 | (b & 1));

Converting the recursion to a loop and sneaking in a bit more simplification:

// please don't pass any command-line arguments
main() {
    int i;
    for(i=447; i>=0; i--) {
        if(i % 64 == 0) {
        } else {
            char t = __TIME__[7 - i/8%8];
            char a = ">'txiZ^(~z?"[t - 48] + 1;
            int shift = ";;;====~$::199"[(i*2&8) | (i/64)];
            if((i & 2) == 0)
                shift /= 8;
            shift = shift % 8;
            char b = a >> shift;
            putchar(32 | (b & 1));

This outputs one character per iteration. Every 64th character, it outputs a newline. Otherwise, it uses a pair of data tables to figure out what to output, and puts either character 32 (a space) or character 33 (a !). The first table (">'txiZ^(~z?") is a set of 10 bitmaps describing the appearance of each character, and the second table (";;;====~$::199") selects the appropriate bit to display from the bitmap.

The second table

Let's start by examining the second table, int shift = ";;;====~$::199"[(i*2&8) | (i/64)];. i/64 is the line number (6 to 0) and i*2&8 is 8 iff i is 4, 5, 6 or 7 mod 8.

if((i & 2) == 0) shift /= 8; shift = shift % 8 selects either the high octal digit (for i%8 = 0,1,4,5) or the low octal digit (for i%8 = 2,3,6,7) of the table value. The shift table ends up looking like this:

row col val
6   6-7 0
6   4-5 0
6   2-3 5
6   0-1 7
5   6-7 1
5   4-5 7
5   2-3 5
5   0-1 7
4   6-7 1
4   4-5 7
4   2-3 5
4   0-1 7
3   6-7 1
3   4-5 6
3   2-3 5
3   0-1 7
2   6-7 2
2   4-5 7
2   2-3 3
2   0-1 7
1   6-7 2
1   4-5 7
1   2-3 3
1   0-1 7
0   6-7 4
0   4-5 4
0   2-3 3
0   0-1 7

or in tabular form


Note that the author used the null terminator for the first two table entries (sneaky!).

This is designed after a seven-segment display, with 7s as blanks. So, the entries in the first table must define the segments that get lit up.

The first table

__TIME__ is a special macro defined by the preprocessor. It expands to a string constant containing the time at which the preprocessor was run, in the form "HH:MM:SS". Observe that it contains exactly 8 characters. Note that 0-9 have ASCII values 48 through 57 and : has ASCII value 58. The output is 64 characters per line, so that leaves 8 characters per character of __TIME__.

7 - i/8%8 is thus the index of __TIME__ that is presently being output (the 7- is needed because we are iterating i downwards). So, t is the character of __TIME__ being output.

a ends up equalling the following in binary, depending on the input t:

0 00111111
1 00101000
2 01110101
3 01111001
4 01101010
5 01011011
6 01011111
7 00101001
8 01111111
9 01111011
: 01000000

Each number is a bitmap describing the segments that are lit up in our seven-segment display. Since the characters are all 7-bit ASCII, the high bit is always cleared. Thus, 7 in the segment table always prints as a blank. The second table looks like this with the 7s as blanks:

11  55  
11  55  
22  33  
22  33  

So, for example, 4 is 01101010 (bits 1, 3, 5, and 6 set), which prints as


To show we really understand the code, let's adjust the output a bit with this table:

11  55
11  55
22  33
22  33

This is encoded as "?;;?==? '::799\x07". For artistic purposes, we'll add 64 to a few of the characters (since only the low 6 bits are used, this won't affect the output); this gives "?{{?}}?gg::799G" (note that the 8th character is unused, so we can actually make it whatever we want). Putting our new table in the original code:


we get

          !!              !!                              !!   
    !!  !!              !!  !!  !!  !!              !!  !!  !! 
    !!  !!              !!  !!  !!  !!              !!  !!  !! 
          !!      !!              !!      !!                   
    !!  !!  !!          !!  !!      !!              !!  !!  !! 
    !!  !!  !!          !!  !!      !!              !!  !!  !! 
          !!              !!                              !!   

just as we expected. It's not as solid-looking as the original, which explains why the author chose to use the table he did.