HTTP
HTTP
HTTP
Lecture # 2
HTTP Overview
CSE 135
Lecture # 2
CSE 135
Lecture # 2
HTTP Intro
HTTP (Hyper Text Transfer Protocol)
Its an application layer protocols similar to SMTP, POP, IMAP, NNTP, FTP, etc. Simple protocol that defines the standard way that clients request data from Web servers and how these server respond Typically it is running on top of TCP/IP
Three versions have been used (0.9,1.0,1.1) and two are still commonly used
RFC 1945 HTTP 1.0 (1996) RFC 2616 HTTP 1.1 (1999)
CSE 135
Lecture # 2
Transport Layer
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
HTTP Messages data stream is chopped up into chunks small enough to fit in a TCP segment
The chunks ride inside TCP segments used to reassemble them correctly on the other end of the connection
CSE 135
Lecture # 2
CSE 135
Lecture # 2
HTTP Request
HTTP Response HTTP Client Asks for resource by its URL: http://www.foo.com/page.html
CSE 135
Lecture # 2
Proxy
Transparent Proxies
Local DNS
CSE 135
Lecture # 2
CSE 135
Lecture # 2
HTTP Requests
HTTP requests and responses are both types of Internet Messages (RFC 822) , and share a general format:
A Start Line, followed by a CRLF
Request Line for requests Status Line for responses
An empty line
Two CRLFs mark the end of the Headers
CSE 135
Lecture # 2
Note: A Host header is required for HTTP 1.1 connections, though not for HTTP 1.0
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
HEAD
Retrieves only the Headers associated with a resource but not the entity itself Highly useful for protocol analysis, diagnostics
POST
Allows passing of data in entity rather than URL Can transmit of far larger arguments that GET Arguments not displayed on the URL
CSE 135
Lecture # 2
TRACE
Diagnostic method for assessing the impact of proxies along the request-response chain
PUT, DELETE
Used in HTTP publishing (e.g., WebDav)
CONNECT
A common extension method for Tunneling other protocols through HTTP
CSE 135
Lecture # 2
Why do I care?
Well if you are doing doing Web programming you may have to form raw requests with headers ourselves.
Example in JavaScript using Ajax you will have to form raw HTTP requests using GET and POST (or even HEAD if you like) to transmit your data
Also in HTML forms when you set the action attribute <form action=GET|POST > you are specifying the HTTP method to transmit the data
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
Observation One Way Requests and 204s There are many details to HTTP that people dont consider but are highly useful one example is 204 responses which send back no data Observe Google using this in its search results page to send what I dub a flare gun request to see what exactly the user clicked on
The purpose of this is for improving search quality and defeating those folks who reverse engineer the Google algortithm The human filter if you like
CSE 135
Lecture # 2
HTTP Headers
Headers come in four major types, some for requests, some for responses, some for both:
General Headers
Provide info about messages of both kinds
Request Headers
Provide request-specific info
Response Headers
Provide response-specific info
Entity Headers
Provide info about request and response entities
CSE 135
Lecture # 2
CSE 135
Lecture # 2
Obvious solution to stale caches is to add cache control headers (or to change resource names) but then again that does defeat the value
Better to know about caching and do it properly Consider typical Web pages what would you want to cache?
CSE 135
Lecture # 2
Referer The URL of the resource from which the current request URI came
Misspelled in the specification
Referer: http://www.host.com/login.asp
CSE 135
Lecture # 2
Cookie How clients pass cookies back to the servers that set them
Cookie: id=23432;level=3
CSE 135
Lecture # 2
Using Request Headers: Browser Sniffing User-agent is often used in browser detection to serve different type of page to different type of accessing agent
Similarity problem
Everything looks like old Mozilla
Better approach is to take this and add in an injected script or program that profiles the device. In the long run as device diversity grows the concept of browser will evolve significantly
CSE 135
Lecture # 2
Since the referer header is sent from the base page a simple form of anti-leeching is to check for it before sending a dependent object Of course the bad guy now moves to forge the header Class Question: can you think of other countermeasures?
CSE 135
Lecture # 2
Using Request Headers: Content Negotiation User-agent sends accept header indicating type of content it can handle
CSE 135
Lecture # 2
Using Request Headers: Content Negotiation A q-rating can indicate the preference the user agent has for the data requested Content negotiation allows us to ask for something like logo and then get the appropriate image (PNG, JPG, etc.) based upon what the device can handle.
This leads to extensionless URLs which aids in long term maintainability Well see the file extensions dont mean much really
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
Compression Considerations Increased origin server CPU or wasted cycles? TTFB vs TTLB consideration and LANs Decompress times Nasty little bugs
http://support.microsoft.com/default.aspx?scid=kb;en-us; 823386&Product=ie600
In Internet Explorer, The bytes that remain to be decoded in the buffer may be small (8 bytes or less) and the data contained in the buffer decompresses to 0 bytes. When Mshtml receives 0 bytes, it thinks that all the data is read and closes the data stream. As a result, the HTML page sometimes appears truncated. Typically, if it is for a referenced file such as a .js or a .css file type, the HTTP connection stops responding.
CSE 135
Lecture # 2
Response Headers
Server The servers name and version
Server: Microsoft-IIS/5.0 Can be problematic for security reasons Security by obscurity?
Vary Tells client & proxy caches which headers were used for content negotiation
Vary: User-Agent, Accept
CSE 135
Lecture # 2
CSE 135
Lecture # 2
Content-Location The actual URL of the resource if different than its request URL
Often used to show the index or default page
Content-Location: http://www.foo.com/home.html
CSE 135
Lecture # 2
This is the most important header to the browser. The data in this header tells the browser what it is receiving. Now it should make sense why file extensions dont really matter and are arbitrary.
Server: file extension -> Mime type Browser: Mime type -> Action (display, download, etc.)
Note: Without HTTP browser relies on file extension example loading a file off local disk.
CSE 135
Lecture # 2
Why do I care? Because sometimes you need to stamp outgoing data on the server-side with the appropriate MIME type
CSE 135
Lecture # 2
CSE 135
Lecture # 2
Etag: adkskdashjgk07563AF
CSE 135
Lecture # 2
Why do I care? Well you could go beyond basic cache-control and pragma headers and do Expires and other forms of cache hints. Ultimately you may be forced to use a query string or alternate file name to force misconfigured caches to stop causing you problems
CSE 135
Lecture # 2
Sending data via HTTP Data can be sent to a server-application in two primary ways:
1. Query String sent via a GET request 2. Data body sent via a POST request
In both cases the data is encoded in a special manner called x-www-form-urlencoded which replaces spaces with + symbols, special characters with %hex values equivalent to the particular special character being escaped and separates individual arguments to be passed with ampersands (&) characters.
Note: Data may be sent via HTTP headers mostly in the form of cookie based data. Though other HTTP headers such as user-agent, referrer, etc. can be tapped, but this is generally not user supplied but instead constitutes the environment in which the Web transaction takes place.
CSE 135
Lecture # 2
However, GET string based URLs are portable you can bookmark them, send to friends, etc.
CSE 135
Lecture # 2
CSE 135
Lecture # 2
Sending Data with GET Contd. Now it should start to make sense what query strings mean and how they are formed
CSE 135
Lecture # 2
Sending Data with GET Contd. Behind the scenes you see that indeed the data is transmitted in the request method itself
CSE 135
Lecture # 2
The POST request sends the data in the message body but does so in x-www-form-urlencoded as well so we might have a message body like Name=Al+Smith&Age=30&Sex=male No size limit, but issues with browsers have to address lack of redos Repost form data?
CSE 135
Lecture # 2
Sending Data with Post Contd. The network trace shows the difference between POST and GET
CSE 135
Lecture # 2
Why do I care? GET and POST have different uses GET used when request is idempotent - meaning multiple requests return same result. POST should be used when you change the state of the server Lots of folks will often use GET for state changes because of ease of coding
Downsides inadvertent state changes by spiders, browsers, etc.
CSE 135
Lecture # 2
Question: How can you keep track of information from one page to the next? Answer:
Hidden Form fields that are posted backed to the server
E.g. Microsofts VIEWSTATE value in .NET
Many programming environments go to significant ends to make provide for easy state management more on this later!
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
versus
http://www.channelregister.co.uk/2010/07/29/cray_1_replica/
An irony is that the resulting scale model Cray-1, , is probably more powerful than Cray's original near 40-year-old design.
CSE 135
Lecture # 2
A physical server can of course serve many protocols (SMTP, FTP, etc.) or may be protocol specific
Web Servers are of course HTTP servers
CSE 135
Lecture # 2
There are no fixed answers to any of these questions Planning should be guided by the goals of the deployment and should harmonize with the related business processes
CSE 135
Lecture # 2
Co-located Server
Pro: Admin control of entire box Con: Must purchase box and manage remotely
Virtual Hosting
Pro: Cheapest and easiest to maintain solution Con: Server is shared, admin access limited
CSE 135
Lecture # 2
IIS
Included in Windows server environment Security black-eye (or is it from the OS?) Favored in business and intranets IIS 6 solid, IIS 7 is VERY Apache like
CSE 135
Lecture # 2
Many app servers (Tomcat, Zope, etc.) include Web servers (or Apache) as part of their distribution
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
Bandwidth sizing should be adjusted based on your actual request frequency and size
Assume peaks at triple or more the average loads
Also watch out for collisions and overloading of routers, switches, hubs and NICs on the network
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
This is the public view of the site the site as visitors will see it when they browse to it
Physical structure is the organization of the files and directories in the file system on the host machines hard disk
This is the private view of the site seen only by you and those users you choose to give access
It will become obvious why this distinction is necessary to keep things straight
CSE 135
Lecture # 2
Often given an index or default document that serves as the homepage of the site. Corresponds to the / at the end of hostname portion of the URL:
http://www.foo.com/index.html (virtual) Ex: /var/www/index.html (physical) Ex: C:\inetpub\wwwroot\index.html (physical)
CSE 135
Lecture # 2
In fact, a URL is purely virtual there is no guarantee that the path to the right of the document root looks this way on disk
Could http://www.foo.com/index2.html map to C:/foo/a/b/c/ myfile.html? Sure you can do this with aliases, redirects, local OS mappings, all sorts of stuff
CSE 135
Lecture # 2
CSE 135
Lecture # 2
Virtual Hosting
We know the hostname part of the URL is a virtual locator for files that live (physically) in a sites document root The idea of virtual hosting takes this a step further by allowing a single server to host many domains, each with its own document root Two methods of virtual hosting
Old way: multiple IP addresses per server New way: name-based using host headers
CSE 135
Lecture # 2
These restrictions should be backed up by access control lists on the directories that enforce the principle of least access
CSE 135
Lecture # 2
If the Web site (or part of it) does not need to be available for anonymous access from everywhere then users, groups, hosts and IPs should be restricted HTTP Authentication can also be employed to require make all or part of a site private and require login
CSE 135
Lecture # 2
If all or part of the site requires authentication and serious security for users login credentials, form based authentication over SSL is the only choice
CSE 135
Lecture # 2
Submit the request to the CA and pay up Retrieve the certificate and install it Test the certificate with an HTTPS request
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
Classic CGI fork and exec Server API running inside Web servers address space
Web application framework running inside Web server process but managing its own pool of resources via IPC
CSE 135
Lecture # 2
CSE 135
Lecture # 2
Server API
Apache modules ISAPI filters and extensions
CSE 135
Lecture # 2
Reading from disk always slower than reading from memory, thus add tons of memory? Memcache? A sliding scale of solutions
Use fast disk controllers (SCSI) or SSD (memory again!) Exploit caching mechanisms to keep as much data as possible in memory Add hardware! (and give it specialized roles)
CSE 135
Lecture # 2
Load Balancers
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
Campaign tracking
Top referring sites/domains/URLs Time/event based spikes or dips
Audience analysis
IP geography, language preference, client host type (.com, .edu, .org, etc.)
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
Network Tap based systems also exist which provide insight into delivery Given the three sides of the Web equation one wonders if this isnt again a question of not versus but working together for a full view
CSE 135
Lecture # 2
CSE 135
Lecture # 2
CSE 135
Lecture # 2
Well-behaved bots will request this first, and obey its directives
#sample robots.txt file User-Agent: * Disallow: /newtoday Disallow: /downloads User-Agent: newsbot Disallow: /pressrreleases
CSE 135
Lecture # 2
CSE 135
Lecture # 2
All monitors usually alert via email, pager, SMS Thresholds can be set to allow for transient errors & delays, or warn of degrading performance
CSE 135
Lecture # 2
CSE 135
Lecture # 2
Server Tuning
Many recommended optimizations are highly specific to Web server vendor/version Some common elements
Disable reverse DNS lookups in logging Shorten connection timeouts (trades some bandwidth for server resources) Remove unneeded server API modules Minimize other application overhead Optimize process & thread pools and limits