Top http Questions

List of Tags
1760
Michiel de Mare

Form based authentication for websites

We believe that Stack Overflow should not just be a resource for very specific technical questions, but also for general guidelines on how to solve variations on common problems. "Form based authentication for websites" should be a fine topic for such an experiment.

It should include topics such as:

  • How to log in
  • How to remain logged in
  • How to store passwords
  • Using secret questions
  • Forgotten password functionality
  • OpenID
  • "Remember me" checkbox
  • Browser autocompletion of usernames and passwords
  • Secret URLs (public URLs protected by digest)
  • Checking password strength
  • E-mail validation
  • and much more ...

It should not include things like:

  • roles and authorization
  • HTTP basic authentication

Please help us by

  1. Suggesting subtopics
  2. Submitting good articles about this subject
  3. Editing the official answer
Answered By: Jens Roland ( 1215)

PART I: How To Log In

  1. As a rule, CAPTCHAs should be a last resort. They tend to be annoying, often aren't human-solvable, most of them are ineffective against bots, all of them are ineffective against cheap third-world labor (according to OWASP, the current sweatshop rate is $12 per 500 tests), and some implementations are technically illegal in some countries (see link number 1 from the MUST-READ list). If you must use a CAPTCHA, use reCAPTCHA, since it is OCR-hard by definition (since it uses already OCR-misclassified book scans).

  2. It is possible to prevent browsers from storing/retrieving a password with the autocomplete tag for forms/input fields. However in the real world, your customers will have many accounts on different systems; it compromises their security if they use the same password for every site. Can you expect them to remember different passwords for every site? There are some good password managers out there, however there are also bad ones - which will become a target for attackers.

  3. The only (currently practical) way to protect against login interception (packet sniffing) during login is by using a certificate-based encryption scheme (for example, SSL) or a proven & tested challenge-response scheme (for example, the Diffie-Hellman-based SRP). Any other method can be easily circumvented by an eavesdropping attacker. On that note: hashing the password client-side (for example, with JavaScript) is useless unless it is combined with one of the above - that is, either securing the line with strong encryption or using a tried-and-tested challenge-response mechanism (if you don't know what that is, just know that it is one of the most difficult to prove, most difficult to design, and most difficult to implement concepts in digital security). Hashing the password is effective against password disclosure, but not against replay attacks, Man-In-The-Middle attacks / hijackings, or brute-force attacks (since we are handing the attacker both username, salt and hashed password).

  4. After sending the authentication tokens, the system needs a way to remember that you have been authenticated - this fact should only ever be stored serverside in the session data. A cookie can be used to reference the session data. Wherever possible, the cookie should have the secure and HTTP Only flags set when sent to the browser. The httponly flag provides some protection against the cookie being read by a XSS attack. The secure flag ensures that the cookie is only sent back via HTTPS, and therefore protects against network sniffing attacks. The value of the cookie should not be predictable. Where a cookie referencing a non-existent session is presented, its value should be replaced immediately to prevent session fixation.

PART II: How To Remain Logged In - The Infamous "Remember Me" Checkbox

Persistent Login Cookies ("remember me" functionality) are a danger zone; on the one hand, they are entirely as safe as conventional logins when users understand how to handle them; and on the other hand, they are an enormous security risk in the hands of most users, who use them on public computers, forget to log out, don't know what cookies are or how to delete them, etc.

Personally, I want my persistent logins for the web sites I visit on a regular basis, but I know how to handle them safely. If you are positive that your users know the same, you can use persistent logins with a clean conscience. If not - well, then you're more like me; subscribing to the philosophy that users who are careless with their login credentials brought it upon themselves if they get hacked. It's not like we go to our user's houses and tear off all those facepalm-inducing Post-It notes with passwords they have lined up on the edge of their monitors, either. If people are idiots, then let them eat idiot cake.

Of course, some systems can't afford to have any accounts hacked; for such systems, there is no way you can justify having persistent logins.

If you DO decide to implement persistent login cookies, this is how you do it:

  1. First, follow Charles Miller's 'Best Practices' article Do not get tempted to follow the 'Improved' Best Practices linked at the end of his article. Sadly, the 'improvements' to the scheme are easily thwarted (all an attacker has to do when stealing the 'improved' cookie is remember to delete the old one. This will require the legitimate user to re-login, creating a new series identifier and leaving the stolen one valid).

  2. And DO NOT STORE THE PERSISTENT LOGIN COOKIE (TOKEN) IN YOUR DATABASE, ONLY A HASH OF IT! The login token is Password Equivalent, so if an attacker got his hands on your database, he/she could use the tokens to log in to any account, just as if they were cleartext login-password combinations. Therefore, use strong salted hashing (bcrypt / phpass) when storing persistent login tokens.

PART III: Using Secret Questions

Don't. Never ever use 'secret questions'. Read the paper from link number 5 from the MUST-READ list. You can ask Sarah Palin about that one, after her Yahoo! email account got hacked during the presidential campaign because the answer to her 'security' question was... (wait for it) ... "Wasilla High School"!

Even with user-specified questions, it is highly likely that most users will choose either:

  • A 'standard' secret question like mother's maiden name or favourite pet

  • A simple piece of trivia that anyone could lift from their blog, LinkedIn profile, or similar

  • Any question that is easier to answer than guessing their password. Which, for any decent password, is every question conceivable.

In conclusion, security questions are inherently insecure in all their forms and variations, and should never be employed in an authentication scheme for any reason.

A secondary question is often considered as adequate for fulfilling a requirement for two-factor authentication. While capturing some of the response via clicks rather than typing in theory provides protection against keylogger attacks, it is still just an extension to the password mechanism - and when users are presented with a text box instead of drop-downs on a phishing site, they rarely perceive this as abnormal. Note that you may be able to fulfill your two-factor obligations by using a long-lasting cookie (granted on submission of multiple authentication questions) in place of a security question asked each and every time - but at the expense of user convenience.

The only reason anyone still uses security questions by choice is that is saves the cost of a few support calls from users who can't remember their email passwords to get to their reactivation codes. At the expense of security and Sara Palin's reputation, that is. Worth it? You be the judge.

PART IV: Forgotten Password Functionality

I already mentioned why you should never use security questions for handling forgotten/lost user passwords. There are at least two more all-too-common pitfalls to avoid in this field:

  1. Don't RESET user's passwords no matter what - 'reset' passwords are harder for the user to remember, which means he/she MUST either change it OR write it down - say, on a bright yellow Post-It on the edge of his monitor. Instead, just let users pick a new one right away - which is what they want to do anyway.

  2. Always hash the lost password code/token in the database. AGAIN, this code is another example of a Password Equivalent, so it MUST be hashed in case an attacker got his hands on your database. When a lost password code is requested, send the plaintext code to the user's email address, then hash it, save the hash in your database -- and throw away the original. Just like a password or a persistent login token.

A final note: always make sure your interface for entering the 'lost password code' is at least as secure as your login form itself, or an attacker will simply use this to gain access instead. Making sure you generate very long 'lost password codes' (for example, 16 case sensitive alphanumeric characters) is a good start, but consider adding the same throttling that you do for logins.

PART V: Checking Password Strength

First, you'll want to read this small article for a reality check: The 500 most common passwords

Okay, so maybe the list isn't the canonical list of most common passwords on any system anywhere ever, but it's a good indication of how poorly people will choose their passwords when there is no enforced policy in place. Plus, the list looks frighteningly close to home when you compare it to the publicly available analyses of 40.000+ recently stolen MySpace passwords.

Well, enough MySpace-bashing for now. Moving on..

So: With no minimum password strength requirements, 2% of users use one of the top 20 most common passwords. Meaning: if an attacker gets just 20 attempts, 1 in 50 accounts on your website will be crackable.

Thwarting this requires calculating the entropy of a password and then applying a threshold. The National Institute of Standards and Technology (NIST) Special Publication 800-63 has a set of very good suggestions. That, when combined with a dictionary and keyboard layout analysis (for example, 'qwertyuiop' is a bad password), can reject 99% of all poorly selected passwords at a level of 18 bits of entropy. Simply calculating password strength and showing a visual strength meter to a user is insufficient. Unless it is enforced, users will ignore it.

PART VI: Much More - Or: Preventing Rapid-Fire Login Attempts

First, have a look at the numbers: Password Recovery Speeds - How long will your password stand up

If you don't have the time to look through the tables in that link, here's the gist of them:

  1. It takes virtually no time to crack a weak password, even if you're cracking it with an abacus

  2. It takes virtually no time to crack an alphanumeric 9-character password, if it is case insensitive

  3. It takes virtually no time to crack an intricate, symbols-and-letters-and-numbers, upper-and-lowercase password, if it is less than 8 characters long (a desktop PC can search the entire keyspace up to 7 characters in a matter of days or even hours)

  4. It would, however, take an inordinate amount of time to crack even a 6-character password, if you were limited to one attempt per second!

So what can we learn from these numbers? Well, lots, but we can focus on the most important part: the fact that preventing large numbers of rapid-fire successive login attempts (ie. the brute force attack) really isn't that difficult. But preventing it right isn't as easy as it seems.

Generally speaking, you have three choices that are all effective against brute-force attacks (and dictionary attacks, but since you are already employing a strong passwords policy, they shouldn't be an issue):

  • Present a CAPTCHA after N failed attempts (annoying as hell and often ineffective -- but I'm repeating myself here)

  • Locking accounts and requiring email verification after N failed attempts (this is a DoS attack waiting to happen)

  • And finally, login throttling: that is, setting a time delay between attempts after N failed attempts (yes, DoS attacks are still possible, but at least they are far less likely and a lot more complicated to pull off).

Best practice #1: A short time delay that increases with the number of failed attempts, like:

  • 1 failed attempt = no delay
  • 2 failed attempts = 2 sec delay
  • 3 failed attempts = 4 sec delay
  • 4 failed attempts = 8 sec delay
  • 5 failed attempts = 16 sec delay
  • etc.

DoS attacking this scheme would be very impractical, but on the other hand, potentially devastating, since the delay increases exponentially. A DoS attack lasting a few days could suspend the user for weeks.

To clarify: The delay is not a delay before returning the response to the browser. It is more like a timeout or refractory period during which login attempts to a specific account or from a specific IP address will not be accepted or evaluated at all. That is, correct credentials will not return in a successful login, and incorrect credentials will not trigger a delay increase.

Best practice #2: A medium length time delay that goes into effect after N failed attempts, like:

  • 1-4 failed attempts = no delay
  • 5 failed attempts = 15-30 min delay

DoS attacking this scheme would be quite impractical, but certainly doable. Also, it might be relevant to note that such a long delay can be very annoying for a legitimate user. Forgetful users will dislike you.

Best practice #3: Combining the two approaches - either a fixed, short time delay that goes into effect after N failed attempts, like:

  • 1-4 failed attempts = no delay
  • 5+ failed attempts = 20 sec delay

Or, an increasing delay with a fixed upper bound, like:

  • 1 failed attempt = 5 sec delay
  • 2 failed attempts = 15 sec delay
  • 3+ failed attempts = 45 sec delay

This final scheme was taken from the OWASP best-practices suggestions (link 1 from the MUST-READ list), and should be considered best practice, even if it is admittedly on the restrictive side.

As a rule of thumb however, I would say: the stronger your password policy is, the less you have to bug users with delays. If you require strong (case-sensitive alphanumerics + required numbers and symbols) 9+ character passwords, you could give the users 2-4 non-delayed password attempts before activating the throttling.

DoS attacking this final login throttling scheme would be very impractical. And as a final touch, always allow persistent (cookie) logins (and/or a CAPTCHA-verified login form) to pass through, so legitimate users won't even be delayed while the attack is in progress. That way, the very impractical DoS attack becomes an extremely impractical attack.

Additionally, it makes sense to do more aggressive throttling on admin accounts, since those are the most attractive entry points

PART VII: Distributed Brute Force Attacks

Just as an aside, more advanced attackers will try to circumvent login throttling by 'spreading their activities':

  • Distributing the attempts on a botnet to prevent IP address flagging

  • Rather than picking one user and trying the 50.000 most common passwords (which they can't, because of our throttling), they will pick THE most common password and try it against 50.000 users instead. That way, not only do they get around maximum-attempts measures like CAPTCHAs and login throttling, their chance of success increases as well, since the number 1 most common password is far more likely than number 49.995

  • Spacing the login requests for each user account, say, 30 seconds apart, to sneak under the radar

Here, the best practice would be logging the number of failed logins, system-wide, and using a running average of your site's bad-login frequency as the basis for an upper limit that you then impose on all users.

Too abstract? Let me rephrase:

Say your site has had an average of 120 bad logins per day over the past 3 months. Using that (running average), your system might set the global limit to 3 times that -- ie. 360 failed attempts over a 24 hour period. Then, if the total number of failed attempts across all accounts exceeds that number within one day (or even better, monitor the rate of acceleration and trigger on a calculated treshold), it activates system-wide login throttling - meaning short delays for ALL users (still, with the exception of cookie logins and/or backup CAPTCHA logins).

I also posted a question with more details and a really good discussion of how to avoid tricky pitfals in fending off distributed brute force attacks

PART VIII: Two-Factor Authentication and Authentication Providers

Credentials can be compromised, whether by exploits, passwords being written down and lost, laptops with keys being stolen, or users entering logins into phishing sites. Logins can be protected with two-factor authentication, which use out-of-band factors such as single-use codes received from a phone call, SMS message, or dongle. Several providers offer two-factor authentication services.

Authentication can be completely delegated to a single-sign-on service such as OAuth, OpenID or Persona (nee BrowserID), where another provider handles collecting credentials. This pushes the problem to a trusted third party. Twitter is an example of an OAuth provider, while Facebook provides a similar proprietary solution.

MUST-READ LINKS About Web Authentication

965
Sander Versluys

Does it differ between browsers?

Does the HTTP protocol dictate it?

Answered By: Paul Dixon ( 1134)

Short answer - de facto limit of 2000 characters

If you keep URLs under 2000 characters, they'll work in virtually any combination of client and server software.

Longer answer - first, the standards...

RFC 2616 (Hypertext Transfer Protocol HTTP/1.1) section 3.2.1 says

The HTTP protocol does not place any a priori limit on the length of
a URI. Servers MUST be able to handle the URI of any resource they serve, and SHOULD be able to handle URIs of unbounded length if they provide GET-based forms that could generate such URIs. A server SHOULD return 414 (Request-URI Too Long) status if a URI is longer than the server can handle (see section 10.4.15).

Note: Servers ought to be cautious about depending on URI lengths above 255 bytes, because some older client or proxy implementations might not properly support these lengths.

...and the reality

That's what the standards say. For the reality, see this research over at boutell.com to see what individual browser and server implementations will support. It's worth a read, but the executive summary is:

Extremely long URLs are usually a mistake. URLs over 2,000 characters will not work in the most popular web browser. Don't use them if you intend your site to work for the majority of Internet users.

Also, be aware that the sitemaps protocol, which allows a site to inform search engines about available pages, has a limit of 2048 characters in a URL. If you intend to use sitemaps, a limit has been decided for you! (see Calin-Andrei Burloiu's answer below)

There's also some research from 2010 into the maximum URL length that search engines will crawl and index. They found the limit was 2047 chars, which appears allied to the sitemap protocol spec. However, they also found the Google SERP tool wouldn't cope with URLs longer than 1855 chars.

Footnote

This is a popular question, and as the original research is nearly 6 years old I'll try to keep it up to date: As of Jan 2013, the advice still stands, as IE8's maximum URL length is 2083 chars, and it seems IE9 has a similar limit.

760
alex

According to the HTTP/1.1 Spec:

The POST method is used to request that the origin server accept the entity enclosed in the request as a new subordinate of the resource identified by the Request-URI in the Request-Line

In other words, POST is used to create.

The PUT method requests that the enclosed entity be stored under the supplied Request-URI. If the Request-URI refers to an already existing resource, the enclosed entity SHOULD be considered as a modified version of the one residing on the origin server. If the Request-URI does not point to an existing resource, and that URI is capable of being defined as a new resource by the requesting user agent, the origin server can create the resource with that URI."

That is, PUT is used to create or update.

So, which one should be used to Create resource? Or one need to support both?

Answered By: Brian R. Bondy ( 638)

Overall:

Both PUT and POST can be used for creating.

You have to ask "what are you performing the action to?" to distinguish what you should be using. If you want to use POST then you would do that to a list of questions. If you want to use PUT then you would do that to a particular question.

Great both can be used, so which one should I use in my RESTful design:

You do not need to support both PUT and POST.

Which is used is left up to you. But just remember to use the right one depending on what object you are referencing in the request.

Some considerations:

  • Do you name your URL objects you create explicitly, or let the server decide? If you name them then use PUT. If you let the server decide then use POST.
  • PUT is idempotent, so if you PUT an object twice, it has no effect. This is a nice property, so I would use PUT when possible.
  • You can update or create a resource with PUT with the same object URL
  • With POST you can have 2 requests coming in at the same time making modifications to a URL, and they may update different parts of the object.

An example:

I wrote the following as part of another answer on SO regarding this:

POST:

Used to modify and update a resource

POST /questions/<existing_question> HTTP/1.1
Host: wahteverblahblah.com

Note that the following is an error:

POST /questions/<new_question> HTTP/1.1
Host: wahteverblahblah.com

If the URL is not yet created, you should not be using POST to create it while specyfing the name. This should result in a 'resource not found' error because <new_question> does not exist yet. You should PUT the resource on the server first.

You could though do something like this to create a resources using POST:

POST /questions HTTP/1.1
Host: wahteverblahblah.com

Note that in this case the resource name is not specified, the new objects URL path would be returned to you.

PUT:

Used to create a resource, or overwrite it. While you specify the resources new URL.

For a new resource:

PUT /questions/<new_question> HTTP/1.1
Host: wahteverblahblah.com

To overwrite an existing resource:

PUT /questions/<existing_question> HTTP/1.1
Host: wahteverblahblah.com
480
Sean McMains

People talk about URLs and URIs as if they're different things, but they look the same to the naked eye. What's the difference between the two?

Answered By: Roger Pate ( 766)

URIs identify and URLs locate; however, locations are also identifications, so every URL is also a URI, but there are URIs which are not URLs.

Examples

  • Roger Pate

This is my name, which is identification. It is like a URI, but cannot be a URL, as it tells you nothing about my location or how to contact me. In this case it also happens to identify at least 5 other people in the USA alone.

  • 4914 West Bay Street, Nassau, Bahamas

This is a location, which is identification for that physical location. It is like both a URL and URI (since all URLs are URIs), and also identifies me indirectly as "resident of..". In this case it uniquely identifies me, but that would change if I get a roommate.

I say "like" because these examples do not follow the required syntax.

Popular confusion

From Wikipedia:

In computing, a Uniform Resource Locator (URL) is a subset of the Uniform Resource Identifier (URI) that specifies where an identified resource is available and the mechanism for retrieving it. In popular usage and in many technical documents and verbal discussions it is often incorrectly used as a synonym for URI, ... [emphasis mine]

Because of this common confusion, many products and documentation incorrectly use one term instead of the other, assign their own distinction, or use them synonymously.

URNs

My name, Roger Pate, could be like a URN, except those are much more regulated and intended to be unique across both space and time.

Because I currently share this name with other people, it's not globally unique and would not be appropriate as a URN. However, even if no other family used this name, I'm named after my paternal grandfather, so it still wouldn't be unique across time. And even if that wasn't the case, the possibility of naming my descendants after me make this unsuitable as a URN.

URNs are different from URLs in this rigid uniqueness constraint, even though they both share the syntax of URIs.

URLConnection is pretty often asked here and the Oracle tutorial is too concise about it. So how do I use it to fire and handle HTTP requests?

Are there other hints and best practices on this that may be useful?

Answered By: BalusC ( 826)

First a disclaimer beforehand: the posted code snippets are all basic examples. You'll need to handle trivial IOExceptions and RuntimeExceptions like NullPointerException, ArrayIndexOutOfBoundsException and consorts yourself.


Preparing:

We first need to know at least the URL and the charset. The parameters are optional and depend on the functional requirements.

String url = "http://example.com";
String charset = "UTF-8";
String param1 = "value1";
String param2 = "value2";
// ...
String query = String.format("param1=%s&param2=%s", 
     URLEncoder.encode(param1, charset), 
     URLEncoder.encode(param2, charset));

The query parameters must be in name=value format and be concatenated by &. You would normally also URL-encode the query parameters with the specified charset using URLEncoder#encode().

The String#format() is just for convenience. I prefer it when I would need the String concatenation operator + more than twice.


Firing a HTTP GET request with (optionally) query parameters:

It's a trivial task. It's the default request method.

URLConnection connection = new URL(url + "?" + query).openConnection();
connection.setRequestProperty("Accept-Charset", charset);
InputStream response = connection.getInputStream();
// ...

Any query string should be concatenated to the URL using ?. The Accept-Charset header may hint the server what encoding the parameters are in. If you don't send any query string, then you can leave the Accept-Charset header away. If you don't need to set any headers, then you can even use the URL#openStream() shortcut method.

InputStream response = new URL(url).openStream();
// ...

Either way, if the other side is a HttpServlet, then its doGet() method will be called and the parameters will be available by HttpServletRequest#getParameter().


Firing a HTTP POST request with query parameters:

Setting the URLConnection#setDoOutput() to true implicitly sets the request method to POST. The standard HTTP POST as web froms do is of type application/x-www-form-urlencoded wherein the query string is written to the request body.

URLConnection connection = new URL(url).openConnection();
connection.setDoOutput(true); // Triggers POST.
connection.setRequestProperty("Accept-Charset", charset);
connection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=" + charset);
OutputStream output = null;
try {
     output = connection.getOutputStream();
     output.write(query.getBytes(charset));
} finally {
     if (output != null) try { output.close(); } catch (IOException logOrIgnore) {}
}
InputStream response = connection.getInputStream();
// ...

Note: whenever you'd like to submit a HTML form programmatically, don't forget to take the name=value pairs of any <input type="hidden"> elements into the query string and of course also the name=value pair of the <input type="submit"> element which you'd like to "press" programmatically (because that's usually been used in the server side to distinguish if a button was pressed and if so, which one).

You can also cast the obtained URLConnection to HttpURLConnection and use its HttpURLConnection#setRequestMethod() instead. But if you're trying to use the connection for output you still need to set URLConnection#setDoOutput() to true.

HttpURLConnection httpConnection = (HttpURLConnection) new URL(url).openConnection();
httpConnection.setRequestMethod("POST");
// ...

Either way, if the other side is a HttpServlet, then its doPost() method will be called and the parameters will be available by HttpServletRequest#getParameter().


Actually firing the HTTP request:

You can fire the HTTP request explicitly with URLConnection#connect(), but the the request will automatically be fired on demand when you want to get any information about the HTTP response, such as the response body using URLConnection#getInputStream() and so on. The above examples does exactly that, so the connect() call is in fact superfluous.


Gathering HTTP response information:

1) HTTP response status:

You need a HttpURLConnection here. Cast it first if necessary.

int status = httpConnection.getResponseCode();

2) HTTP response headers:

for (Entry<String, List<String>> header : connection.getHeaderFields().entrySet()) {
    System.out.println(header.getKey() + "=" + header.getValue());
}

3) HTTP response encoding:

When the Content-Type contains a charset parameter, then the response body is likely text based and we'd like to process the response body with the server-side specified character encoding then.

String contentType = connection.getHeaderField("Content-Type");
String charset = null;
for (String param : contentType.replace(" ", "").split(";")) {
    if (param.startsWith("charset=")) {
        charset = param.split("=", 2)[1];
        break;
    }
}

if (charset != null) {
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(response, charset));
        for (String line; (line = reader.readLine()) != null;) {
            // ... System.out.println(line) ?
        }
    } finally {
        if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
    }
} else {
    // It's likely binary content, use InputStream/OutputStream.
}

Maintaining the session:

The server side session is usually backed by a cookie. Some web forms require that you're logged in and/or are tracked by a session. You can use the CookieHandler API to maintain cookies. You need to prepare a CookieManager with a CookiePolicy of ACCEPT_ALL before sending all HTTP requests.

// First set the default cookie manager.
CookieHandler.setDefault(new CookieManager(null, CookiePolicy.ACCEPT_ALL));

// All the following subsequent URLConnections will use the same cookie manager.
URLConnection connection = new URL(url).openConnection();
// ...

connection = new URL(url).openConnection();
// ...

connection = new URL(url).openConnection();
// ...

Note that this is known to not always work properly in all circumstances. If it fails for you, then best is to manually gather and set the cookie headers. You basically need to grab all Set-Cookie headers from the response of the login or the first GET request and then pass this through the subsequent requests.

// Gather all cookies on the first request.
URLConnection connection = new URL(url).openConnection();
List<String> cookies = connection.getHeaderFields().get("Set-Cookie");
// ...

// Then use the same cookies on all subsequent requests.
connection = new URL(url).openConnection();
for (String cookie : cookies) {
    connection.addRequestProperty("Cookie", cookie.split(";", 2)[0]);
}
// ...

The split(";", 2)[0] is there to get rid of cookie attributes which are irrelevant for the server side like expires, path, etc. Alternatively, you could also use cookie.substring(0, cookie.indexOf(';')) instead of split().


Streaming mode

The HttpURLConnection will by default buffer the entire request body before actually sending it, regardless of whether you've set a fixed content length yourself using connection.setRequestProperty("Content-Length", contentLength);. This may cause OutOfMemoryExceptions whenever you concurrently send large POST requests (e.g. uploading files). To avoid this, you would like to set the HttpURLConnection#setFixedLengthStreamingMode().

httpConnection.setFixedLengthStreamingMode(contentLength);

But if the content length is really not known beforehand, then you can make use of chunked streaming mode by setting the HttpURLConnection#setChunkedStreamingMode() accordingly. This will set the HTTP Transfer-Encoding header to chunked which will force the request body being sent in chunks. The below example will send the body in chunks of 1KB.

httpConnection.setChunkedStreamingMode(1024);

User-Agent

It can happen that a request returns an unexpected response, while it works fine with a real web browser. The server side is probably blocking requests based on the User-Agent request header. The URLConnection will by default set it to Java/1.6.0_19 where the last part is obviously the JRE version. You can override this as follows:

connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401"); // Do as if you're using Firefox 3.6.3.

Use the User-Agent string from a recent browser.


Error handling

If the HTTP response code is 4nn (Client Error) or 5nn (Server Error), then you may want to read the HttpURLConnection#getErrorStream() to see if the server has sent any useful error information.

InputStream error = ((HttpURLConnection) connection).getErrorStream();

If the HTTP response code is -1, then something went wrong with connection and response handling. The HttpURLConnection implementation is somewhat buggy with keeping connections alive. You may want to turn if off by setting the http.keepAlive system property to false. You can do this programmatically in the beginning of your application by:

System.setProperty("http.keepAlive", "false");

Uploading files.

You'd normally use multipart/form-data encoding for mixed POST content (binary and character data). The encoding is in more detail described in RFC2388.

String param = "value";
File textFile = new File("/path/to/file.txt");
File binaryFile = new File("/path/to/file.bin");
String boundary = Long.toHexString(System.currentTimeMillis()); // Just generate some unique random value.
String CRLF = "\r\n"; // Line separator required by multipart/form-data.

URLConnection connection = new URL(url).openConnection();
connection.setDoOutput(true);
connection.setRequestProperty("Content-Type", "multipart/form-data; boundary=" + boundary);
PrintWriter writer = null;
try {
    OutputStream output = connection.getOutputStream();
    writer = new PrintWriter(new OutputStreamWriter(output, charset), true); // true = autoFlush, important!

    // Send normal param.
    writer.append("--" + boundary).append(CRLF);
    writer.append("Content-Disposition: form-data; name=\"param\"").append(CRLF);
    writer.append("Content-Type: text/plain; charset=" + charset).append(CRLF);
    writer.append(CRLF);
    writer.append(param).append(CRLF).flush();

    // Send text file.
    writer.append("--" + boundary).append(CRLF);
    writer.append("Content-Disposition: form-data; name=\"textFile\"; filename=\"" + textFile.getName() + "\"").append(CRLF);
    writer.append("Content-Type: text/plain; charset=" + charset).append(CRLF);
    writer.append(CRLF).flush();
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(new FileInputStream(textFile), charset));
        for (String line; (line = reader.readLine()) != null;) {
            writer.append(line).append(CRLF);
        }
    } finally {
        if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
    }
    writer.flush();

    // Send binary file.
    writer.append("--" + boundary).append(CRLF);
    writer.append("Content-Disposition: form-data; name=\"binaryFile\"; filename=\"" + binaryFile.getName() + "\"").append(CRLF);
    writer.append("Content-Type: " + URLConnection.guessContentTypeFromName(binaryFile.getName())).append(CRLF);
    writer.append("Content-Transfer-Encoding: binary").append(CRLF);
    writer.append(CRLF).flush();
    InputStream input = null;
    try {
        input = new FileInputStream(binaryFile);
        byte[] buffer = new byte[1024];
        for (int length = 0; (length = input.read(buffer)) > 0;) {
            output.write(buffer, 0, length);
        }
        output.flush(); // Important! Output cannot be closed. Close of writer will close output as well.
    } finally {
        if (input != null) try { input.close(); } catch (IOException logOrIgnore) {}
    }
    writer.append(CRLF).flush(); // CRLF is important! It indicates end of binary boundary.

    // End of multipart/form-data.
    writer.append("--" + boundary + "--").append(CRLF);
} finally {
    if (writer != null) writer.close();
}

If the other side is a HttpServlet, then its doPost() method will be called and the parts will be available by HttpServletRequest#getPart() (note, thus not getParameter() and so on!). The getPart() method is however relatively new, it's introduced in Servlet 3.0 (Glassfish 3, Tomcat 7, etc). Prior to Servlet 3.0, your best choice is using Apache Commons FileUpload to parse a multipart/form-data request. Also see this answer for examples of both the FileUpload and the Servelt 3.0 approaches.


Last words:

The Apache HttpComponents HttpClient is much more convenient in this all :)


Parsing and extracting HTML:

If all you want is parsing and extracting data from HTML, then better use a HTML parser like Jsoup

402
hasen j

What exactly is RESTful programming?

Don't give me links to wikipedia please, I'm hoping for a straight-forward answer, not some BUZZ-word-ful answer.

Bonus question: Should I feel stupid because I never heard about it outside SO?

Answered By: Emil H ( 446)

Really, what it's about is using the true potential of HTTP. The protocol is oriented around verbs and resources. The two verbs in mainstream usage are GET and POST, which I think everyone will recognize. However, the HTTP standard defines several others such as PUT and DELETE. These verbs are then applied to resources.

For example. Let's imagine that we have a user database that is managed by an web service. Using the non-restful approach it would probably look something like this:

/user_create
/user?id=xxx
/user_edit?id=xxx
/user_delete?id=xxx

A restful application would probably expose an API with a single base resource.

/user
/user/xxx

If you want to create a new user you simply send a POST request to that URL with the data you want to create. If you want to retrieve a user you send a GET request to /user/xxx. If you want to update a user you simply send the fields you want to update using a PUT request to /user/xxx. If you want to delete it, you send a DELETE request instead.

An even more concrete example. In a RESTful application you'll never modify data using a GET request. This is what PUT, POST and DELETE are for. Most web applications do this all the time, though, and are therefore not RESTful.

There's an article called How I explained REST to my wife. As I recall, that's the first time I really got it. :)

You might also want to take a look at WebDAV which extends HTTP and adds even more verbs. Subversion uses this, for example.

Edit: REST is in reality much more complicated, and powerful, than what I've been able to pass on here. This is just a simple example of some of the principles as they apply to HTTP. Other answers on this page do a good job at explaining the concepts in more detail.

326
dbr

I can find lots of information on how Long Polling works (For example, this, and this), but no simple examples of how to implement this in code.

All I can find is cometd, which relies on the Dojo JS framework, and a fairly complex server system..

Basically, how would I use Apache to serve the requests, and how would I write a simple script (say, in PHP) which would "long-poll" the server for new messages?

The example doesn't have to be scaleable, secure or complete, it just needs to work!

Answered By: dbr ( 266)

It's simpler than I initially thought.. Basically you have a page that does nothing, until the data you want to send is available (say, a new message arrives).

Here is a really basic example, which sends a simple string after 2-10 seconds. 1 in 3 chance of returning an error 404 (to show error handling in the coming Javascript example)

msgsrv.php

<?php
if(rand(1,3) == 1){
    /* Fake an error */
    header("HTTP/1.0 404 Not Found");
    die();
}

/* Send a string after a random number of seconds (2-10) */
sleep(rand(2,10));
echo("Hi! Have a random number: " . rand(1,10));
?>

Note: With a real site, running this on a regular web-server like Apache will quickly tie up all the "worker threads" and leave it unable to respond to other requests.. There are ways around this, but it is recommended to write a "long-poll server" in something like Python's twisted, which does not rely on one thread per request. cometD is an popular one (which is available in several languages), and Tornado is a new framework made specifically for such tasks (it was built for FriendFeed's long-polling code)... but as a simple example, Apache is more than adequate! This script could easily be written in any language (I chose Apache/PHP as they are very common, and I happened to be running them locally)

Then, in Javascript, you request the above file (msg_srv.php), and wait for a response. When you get one, you act upon the data. Then you request the file and wait again, act upon the data (and repeat)

What follows is an example of such a page.. When the page is loaded, it sends the initial request for the msgsrv.php file.. If it succeeds, we append the message to the #messages div, then after 1 second we call the waitForMsg function again, which triggers the wait.

The 1 second setTimeout() is a really basic rate-limiter, it works fine without this, but if msgsrv.php always returns instantly (with a syntax error, for example) - you flood the browser and it can quickly freeze up. This would better be done checking if the file contains a valid JSON response, and/or keeping a running total of requests-per-minute/second, and pausing appropriately.

If the page errors, it appends the error to the #messages div, waits 15 seconds and then tries again (identical to how we wait 1 second after each message)

The nice thing about this approach is it is very resilient. If the clients internet connection dies, it will timeout, then try and reconnect - this is inherent in how long polling works, no complicated error-handling is required

Anyway, the long_poller.htm code, using the jQuery framework:

<html>
<head>
    <title>BargePoller</title>
    <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.2.6/jquery.min.js" type="text/javascript" charset="utf-8"></script>

    <style type="text/css" media="screen">
      body{ background:#000;color:#fff;font-size:.9em; }
      .msg{ background:#aaa;padding:.2em; border-bottom:1px #000 solid}
      .old{ background-color:#246499;}
      .new{ background-color:#3B9957;}
    .error{ background-color:#992E36;}
    </style>

    <script type="text/javascript" charset="utf-8">
    function addmsg(type, msg){
        /* Simple helper to add a div.
        type is the name of a CSS class (old/new/error).
        msg is the contents of the div */
        $("#messages").append(
            "<div class='msg "+ type +"'>"+ msg +"</div>"
        );
    }

    function waitForMsg(){
        /* This requests the url "msgsrv.php"
        When it complete (or errors)*/
        $.ajax({
            type: "GET",
            url: "msgsrv.php",

            async: true, /* If set to non-async, browser shows page as "Loading.."*/
            cache: false,
            timeout:50000, /* Timeout in ms */

            success: function(data){ /* called when request to barge.php completes */
                addmsg("new", data); /* Add response to a .msg div (with the "new" class)*/
                setTimeout(
                    waitForMsg, /* Request next message */
                    1000 /* ..after 1 seconds */
                );
            },
            error: function(XMLHttpRequest, textStatus, errorThrown){
                addmsg("error", textStatus + " (" + errorThrown + ")");
                setTimeout(
                    waitForMsg, /* Try again after.. */
                    15000); /* milliseconds (15seconds) */
            }
        });
    };

    $(document).ready(function(){
        waitForMsg(); /* Start the inital request */
    });
    </script>
</head>
<body>
    <div id="messages">
        <div class="msg old">
            BargePoll message requester!
        </div>
    </div>
</body>
</html>
302
Joseph Holsten

I'm trying to direct a browser to a different page. If I wanted a GET request, I might say

document.location.href = 'http://example.com/q=a';

But the resource I'm trying to access won't respond properly unless I use a POST request. If this were not dynamically generated, I might use the HTML

<form action="http://example.com/" method="POST">
  <input type="hidden" name="q" value="a">
</form>

Then I would just submit the form from the DOM.

But really I would like JavaScript code that allows me to say

post_to_url('http://example.com/', {'q':'a'});

What's the best cross browser implementation?

Edit

I'm sorry I was not clear. I need a solution that changes the location of the browser, just like submitting a form. If this is possible with XMLHttpRequest, it is not obvious. And this should not be asynchronous, nor use XML, so Ajax is not the answer.

Answered By: Rakesh Pai ( 461)
function post_to_url(path, params, method) {
    method = method || "post"; // Set method to post by default, if not specified.

    // The rest of this code assumes you are not using a library.
    // It can be made less wordy if you use one.
    var form = document.createElement("form");
    form.setAttribute("method", method);
    form.setAttribute("action", path);

    for(var key in params) {
        if(params.hasOwnProperty(key)) {
            var hiddenField = document.createElement("input");
            hiddenField.setAttribute("type", "hidden");
            hiddenField.setAttribute("name", key);
            hiddenField.setAttribute("value", params[key]);

            form.appendChild(hiddenField);
         }
    }

    document.body.appendChild(form);
    form.submit();
}

EDIT: Since this has gotten upvoted so much, I'm guessing people will be copy-pasting this a lot. So I added the hasOwnProperty check to fix any inadvertent bugs.

257
John Millikin

I've seen a couple questions around here like How to debug RESTful services, which mentions:

Unfortunately that same browser won't allow me to test HTTP PUT, DELETE, and to a certain degree even HTTP POST.

I've also heard this, that browsers support only GET and POST, from some other sources like:

However, a few quick tests in Firefox show that sending PUT and DELETE requests works as expected -- the XMLHttpRequest completes successfully, and the request shows up in the server logs with the right method. Is there some aspect to this I'm missing, such as cross-browser compatibility or non-obvious limitations?

Answered By: Matthew Murdoch ( 170)

HTML forms (up to HTML version 4 and XHTML 1) only support GET and POST as HTTP request methods. A workaround for this is to tunnel other methods through POST by using a hidden form field which is read by the server and the request dispatched accordingly.

However, for the vast majority of RESTful web services GET, POST, PUT and DELETE should be sufficient. All these methods are supported by the implementations of XMLHttpRequest in all the major web browsers (IE, Firefox, Opera).

165
Morgan Cheng

Is there a standard for what actions F5 and Ctrl+F5 trigger in web browsers?

I once did experiment in IE6 and Firefox 2.x. The "F5" refresh would trigger a HTTP request sent to the server with an "If-Modified-Since" header, while "Ctrl+F5" would not have such a header. In my understanding, F5 will try to utilize cached content as much as possible, while "Ctrl+F5" is intended to abandon all cached content and just retrieve all content from the servers again.

But today, I noticed that in some of the latest browsers (Chrome, IE8) it doesn't work in this way anymore. Both "F5" and "Ctrl+F5" send the "If-Modified-Since" header.

So how is this supposed to work, or (if there is no standard) how do the major browsers differ in how they implement these refresh features?

Answered By: some ( 335)

It is up to the browser but they behave in similar ways.

I have tested FF, IE7, Opera and Chrome.

F5 usually updates the page only if it is modified. The browser usually tries to use all types of cache as much as possible and adds an "If-modified-since" header to the request. Opera differs by sending a "Cache-Control: no-cache".

CTRL-F5 is used to force an update, disregarding any cache. IE7 adds an "Cache-Control: no-cache", as does FF, which also adds "Pragma: no-cache". Chrome does a normal "If-modified-since" and Opera ignores the key.

If I remember correctly it was Netscape which was the first browser to add support for cache-control by adding "Pragma: No-cache" when you pressed CTRL-F5.

Edit: Updated table

The table below is updated with information on what will happen when the browser's refresh-button is clicked (after a request by Joel Coehoorn), and the "max-age=0" Cache-control-header.

Updated table, 27 September 2010

+------------+-----------------------------------------------+
|  UPDATED   |                Firefox 3.x                    |
|27 SEP 2010 |  +--------------------------------------------+
|            |  |             MSIE 8, 7                      |
| Version 3  |  |  +-----------------------------------------+
|            |  |  |          Chrome 6.0                     |
|            |  |  +  +--------------------------------------+
|            |  |  |  |       Chrome 1.0                     |
|            |  |  |  |  +-----------------------------------+
|            |  |  |  |  |    Opera 10, 9                    |
|            |  |  |  |  |  +--------------------------------+
|            |  |  |  |  |  |                                |
+------------+--+--+--|--+-----------------------------------+
|          F5|IM|I |IM|IM|C |                                |
|    SHIFT-F5|- |- |CP|IM|- | Legend:                        |
|     CTRL-F5|CP|C |CP|IM|- | I = "If-Modified-Since"        |
|      ALT-F5|- |- |- |- |*2| P = "Pragma: No-cache"         |
|    ALTGR-F5|- |I |- |- |- | C = "Cache-Control: no-cache"  |
+------------+--+--+--|--+--+ M = "Cache-Control: max-age=0" |
|      CTRL-R|IM|I |IM|IM|C | - = ignored                    |
|CTRL-SHIFT-R|CP|- |CP|- |- |                                |
+------------+--+--+--|--+--+                                |
|       Click|IM|I |IM|IM|C | With 'click' I refer to a      |
| Shift-Click|CP|I |CP|IM|C | mouse click on the browsers    |
|  Ctrl-Click|*1|C |CP|IM|C | refresh-icon.                  |
|   Alt-Click|IM|I |IM|IM|C |                                |
| AltGr-Click|IM|I |- |IM|- |                                |
+------------+--+--+--+--+--+--------------------------------+

Versions tested:

  • Firefox 3.1.6 and 3.0.6 (WINXP)
  • MSIE 8.0.6001 and 7.0.5730.11 (WINXP)
  • Chrome 6.0.472.63 and 1.0.151.48 (WINXP)
  • Opera 10.62 and 9.61 (WINXP)

Notes:

  1. Version 3.0.6 sends I and C, but 3.1.6 opens the page in a new tab, making a normal request with only "I".

  2. Version 10.62 does nothing. 9.61 might do C unless it was a typo in my old table.

Note about Chrome 6.0.472: If you do a forced reload (like CTRL-F5) it behaves like the url is internally marked to always do a forced reload. The flag is cleared if you go to the address bar and press enter.