Top data-structures Questions

List of Tags

There are some data structures around that are really useful but are unknown to most programmers. Which ones are they?

Everybody knows about linked lists, binary trees, and hashes, but what about Skip lists and Bloom filters for example. I would like to know more data structures that are not so common, but are worth knowing because they rely on great ideas and enrich a programmer's tool box.

PS: I am also interested in techniques like Dancing links which make clever use of properties of a common data structure.

EDIT: Please try to include links to pages describing the data structures in more detail. Also, try to add a couple of words on why a data structure is cool (as Jonas Kölker already pointed out). Also, try to provide one data-structure per answer. This will allow the better data structures to float to the top based on their votes alone.

Answered By: David Phillips ( 273)

Tries, also known as prefix-trees or crit-bit trees, have existed for over 40 years but are still relatively unknown. A very cool use of tries is described in "TRASH - A dynamic LC-trie and hash data structure", which combines a trie with a hash function.

I've always been one to simply use List<String> names = new ArrayList<String>();
I use the interface as the type name for portability, so that when I ask questions such as these I can rework my code.

When should LinkedList be used over ArrayList and vice-versa?

Answered By: Jonathan Tran ( 495)

LinkedList and ArrayList are two different implementations of the List interface. LinkedList implements it with a doubly-linked list. ArrayList implements it with a dynamically resizing array.

As with standard linked list and array operations, the various methods will have different algorithmic runtimes.

For LinkedList

  • get is O(n)
  • add is O(1)
  • remove is O(n)
  • Iterator.remove is O(1)

For ArrayList

  • get is O(1)
  • add is O(1) amortized, but O(n) worst-case since the array must be resized and copied
  • remove is O(n)

LinkedList allows for constant-time insertions or removals, but only sequential access of elements. In other words, you can walk the list forwards or backwards, but grabbing an element in the middle takes time proportional to the size of the list.

ArrayLists, on the other hand, allow random access, so you can grab any element in constant time. But adding or removing from anywhere but the end requires shifting all the latter elements over, either to make an opening or fill the gap. Also, if you add more elements than the capacity of the underlying array, a new array (twice the size) is allocated, and the old array is copied to the new one, so adding to an ArrayList is O(n) in the worst case but constant on average.

So depending on the operations you intend to do, you should choose the implementations accordingly. Iterating over either kind of List is practically equally cheap. (Iterating over an ArrayList is technically faster, but unless you're doing something really performance-sensitive, you shouldn't worry about this -- they're both constants.)

Also, if you have large lists, keep in mind that memory usage is also different. Each element of a LinkedList has more overhead since pointers to the next and previous elements are also stored. ArrayLists don't have this overhead. However, ArrayLists take up as much memory as is allocated for the capacity, regardless of whether elements have actually been added.

The default initial capacity of an ArrayList is pretty small (10 from Java 1.4 - 6). But since the underlying implementation is an array, the array must be resized if you add a lot of elements. To avoid the high cost of resizing when you know you're going to add a lot of elements, construct the ArrayList with a higher initial capacity.

It's worth noting that Vector also implements the List interface and is almost identical to ArrayList. The difference is that Vector is synchronized, so it is thread-safe. Because of this, it is also slightly slower than ArrayList. So as far as I understand, most Java programmers avoid Vector in favor of ArrayList since they will probably synchronize explicitly anyway if they care about that.

337
Nakul Chaudhary

In most of programming languages, dictionaries are preferred over hashtables. What are the reasons behind that?

Answered By: Michael Madsen ( 462)

FWIW, a Dictionary is a hash table.

If you meant "why do we use the Dictionary class instead of the Hashtable class?", then it's an easy answer: Dictionary is a generic type, Hashtable is not. That means you get type safety with Dictionary, because you can't insert any random object into it, and you don't have to cast the values you take out.

Hi. I have a data structure that represents a directed graph, and I want to render that dynamically on an HTML page. Does anyone know of any javascript code that can do a reasonable job with graph layout? These graphs will usually be just a few nodes, maybe ten at the very upper end, so my guess is that performance isn't going to be a big deal. Ideally, I'd like to be able to hook it in with jQuery so that users can tweak the layout manually by dragging the nodes around.


Edit: Google's Visualization API seems to be more "graphs as charts" oriented than "graphs as nodes" oriented. I didn't see any node-oriented visualizations already built there, anyway. Do you know that one exists?

Answered By: Johann Philipp Strathausen ( 229)

I've just put together what you may be looking for: http://www.graphdracula.net

It's JavaScript with directed graph layouting, SVG and you can even drag the nodes around. Still needs some tweaking, but is totally usable. You create nodes and edges easily with JavaScript code like this:

var g = new Graph();
g.addEdge("strawberry", "cherry");
g.addEdge("cherry", "apple");
g.addEdge("id34", "cherry");

I used the previously mentioned Raphael JS library (the graffle example) plus some code for a force based graph layout algorithm I found on the net (everything open source, MIT license). If you have any remarks or need a certain feature, I may implement it, just ask!

Edit: You may want to have a look at other projects, too! Here's a list of similar projects (some have been already mentioned here):

I'm trying to answer two questions in a definitive list:

  1. What are the underlying data structures used for Redis?
  2. And what are the main advantages/disadvantages/use cases for each type?

So, I've read the Redis lists are actually implemented with linked lists. But for other types, I'm not able to dig up any information. Also, if someone were to stumble upon this question and not have a high level summary of the pros and cons of modifying or accessing different data structures, they'd have a complete list of when to best use specific types to reference as well.

Specifically, I'm looking to outline all types: string, list, set, zset and hash.

Oh, I've looked at these article, among others, so far:

Answered By: antirez ( 352)

I'll try to answer your question, but I'll start with something that may look strange at first: if you are not interested in Redis internals you should not care about how data types are implemented internally. This is for a simple reason: for every Redis operation you'll find the time complexity in the documentation and, if you have the set of operations and the time complexity, the only other thing you need is some clue about memory usage (and because we do many optimizations that may vary depending on data, the best way to get these latter figures are doing a few trivial real world tests).

But since you asked, here is the underlying implementation of every Redis data type.

  • Strings are implemented using a C dynamic string library so that we don't pay (asymptotically speaking) for allocations in append operations. This way we have O(N) appends, for instance, instead of having quadratic behavior.
  • Lists are implemented with linked lists.
  • Sets and Hashes are implemented with hash tables.
  • Sorted sets are implemented with skip lists (a peculiar type of balanced trees).

But when lists, sets, and sorted sets are small in number of items and size of the largest values, a different, much more compact encoding is used. This encoding differs for different types, but has the feature that it is a compact blob of data that often forces an O(N) scan for every operation. Since we use this format only for small objects this is not an issue; scanning a small O(N) blob is cache oblivious so practically speaking it is very fast, and when there are too many elements the encoding is automatically switched to the native encoding (linked list, hash, and so forth).

But your question was not really just about internals, your point was What type to use to accomplish what?.

Strings

This is the base type of all the types. It's one of the four types but is also the base type of the complex types, because a List is a list of strings, a Set is a set of strings, and so forth.

A Redis string is a good idea in all the obvious scenarios where you want to store an HTML page, but also when you want to avoid converting your already encoded data. So for instance, if you have JSON or MessagePack you may just store objects as strings. In Redis 2.6 you can even manipulate this kind of object server side using Lua scripts.

Another interesting usage of strings is bitmaps, and in general random access arrays of bytes, since Redis exports commands to access random ranges of bytes, or even single bits. For instance check this good blog post: Fast Easy real time metrics using Redis.

Lists

Lists are good when you are likely to touch only the extremes of the list: near tail, or near head. Lists are not very good to paginate stuff, because random access is slow, O(N). So good uses of lists are plain queues and stacks, or processing items in a loop using RPOPLPUSH with same source and destination to "rotate" a ring of items.

Lists are also good when we want just to create a capped collection of N items where usually we access just the top or bottom items, or when N is small.

Sets

Sets are an unordered data collection, so they are good every time you have a collection of items and it is very important to check for existence or size of the collection in a very fast way. Another cool thing about sets is support for peeking or popping random elements (SRANDMEMBER and SPOP commands).

Sets are also good to represent relations, e.g., "What are friends of user X?" and so forth. But other good data structures for this kind of stuff are sorted sets as we'll see.

Sets support complex operations like intersections, unions, and so forth, so this is a good data structure for using Redis in a "computational" manner, when you have data and you want to perform transformations on that data to obtain some output.

Small sets are encoded in a very efficient way.

Hashes

Hashes are the perfect data structure to represet objects, composed of fields and values. Fields of hashes can also be atomically incremented using HINCRBY. When you have objects such as users, blog posts, or some other kind of item, hashes are likely the way to go if you don't want to use your own encoding like JSON or similar.

However, keep in mind that small hashes are encoded very efficiently by Redis, and you can ask Redis to atomically GET, SET or increment individual fields in a very fast fashion.

Hashes can also be used to represent linked data structures, using references. For instance check the lamernews.com implementation of comments.

Sorted Sets

Sorted sets are the only other data structures, besides lists, to maintain ordered elements. You can do a number of cool stuff with sorted sets. For instance, you can have all kinds of Top Something lists in your web application. Top users by score, top posts by pageviews, top whatever, but a single Redis instance will support tons of insertion and get-top-elements operations per second.

Sorted sets, like regular sets, can be used to describe relations, but they also allow you to paginate the list of items and to remember the ordering. For instance, if I remember friends of user X with a sorted set I can easily remember them in order of accepted friendship.

Sorted sets are good for priority queues.

Sorted sets are like more powerful lists where inserting, removing, or getting ranges from the the middle of the list is always fast. But they use more memory, and are O(log(N)) data structures.

Conclusion

I hope that I provided some info in this post, but it is far better to download the source code of lamernews from http://github.com/antirez/lamernews and understand how it works. Many data structures from Redis are used inside Lamer News, and there are many clues about what to use to solve a given task.

Sorry for grammar typos, it's midnight here and too tired to review the post ;)