Session 14 - Server-Side: HTTP and Apache Web Server Configuration

Harvard Extension School  
Spring 2021

Course Web Site: https://cscie12.dce.harvard.edu/

Topics

  1. Wrapping Up
  2. The Internet and the Web
  3. HyperText Transfer Protocol
  4. Apache HTTP Server
  5. Caching - Don't deliver content unnecessarily
  6. Minify and Compress Content
  7. Friendly Errors
  8. Friendly Ways to Get There
  9. Search Engines and Optimization
  10. Accessibility
  11. Oh, The Places You'll Go!

Session 14 - Server-Side: HTTP and Apache Web Server Configuration, slide1
Wrapping Up, slide2
Some Observations, slide3
My Project Update, slide4
The Internet and the Web, slide5
Domain Names: Top Level Domains (TLD), slide6
Getting Your Own Domain and Hosting, slide7
Web Server Software, slide8
HyperText Transfer Protocol, slide9
HTTP Overview, slide10
HTTP Response Codes, slide11
Common Headers, slide12
Looking at HTTP Under the Hood, slide13
HTTP Header: Host, slide14
HTTP/2, slide15
Apache HTTP Server, slide16
Apache Configuration Overview, slide17
Scope of .htaccess files, slide18
Problems You Will Have with .htaccess files, slide19
500 Internal Server Error, slide20
Problems You will encounter when using .htaccess files (Internal Server Error 500), slide21
Problems You will encounter when using .htaccess files (Can't see the .htaccess file), slide22
Apache Configuration Sections, slide23
Caching - Don't deliver content unnecessarily, slide24
Caching Related Headers, slide25
If-Modified-Since, slide26
Expires HTTP Header, slide27
Do not cache, slide28
Minify and Compress Content, slide29
Compress Content, slide30
Does Compressing Help?, slide31
Friendly Errors, slide32
Custom Error Documents, slide33
Friendly Ways to Get There, slide34
HTTP Redirect, slide35
Redirect, slide36
Rewrite, slide37
Example - Make Simple Links Instead of Complex Ones, slide38
Example: Create Links that can always point to the correct place, slide39
URL Shortener Services, slide40
Search Engines and Optimization, slide41
Content: meta tags, slide42
Search Robots, Crawlers, Spiders, slide43
robots.txt and Examples, slide44
Robots meta element in markup, slide45
Accessibility, slide46
Accessibility: Getting Started, slide47
Oh, The Places You'll Go!, slide48

Presentation contains 48 slides

Wrapping Up

Some Observations

My Project Update

The Internet and the Web

Internet Routing

Domain Names: Top Level Domains (TLD)

TLDs are managed by the Internet Assigned Numbers Authority (IANA)

Generic: .com, .org, .edu, .gov, etc.

Country codes: .ch, .cn, .de, .uk, .us, etc.

Full listing of TLDs

Getting Your Own Domain and Hosting

  1. Domain Name
    • Buy the domain through a "registrar"
    • Provide name servers
    • About $10/yr
  2. Hosting
    • Shared ($7-15/mo)
    • Private / Cloud

A very short list of hosting companies as a place to start.

Web Server Software

Web Server Market Share

Netcraft Web Server Survey

HyperText Transfer Protocol

GET

United States National Archives
www.archives.gov

GET / HTTP/1.1
Host: www.archives.gov
User-Agent: curl/7.49.0
Accept: */*

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Date: Thu, 30 Jul 2020 23:25:09 GMT
Content-Language: en
Set-Cookie: UUID=7efbfc41-6054-bf24-f977-24eb8d075e4e; expires=Fri, 30-Jul-2021 23:06:47 GMT; Max-Age=31536000; path=/; domain=.archives.gov; httponly
Last-Modified: Thu, 30 Jul 2020 23:06:47 GMT
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Content-Type-Options: nosniff
ETag: W/"1596150407-0-gzip"
v-ttl: 2497
Cache-Control: public, max-age=60, s-maxage=180
v-cache-ttl: 2497
X-Frame-Options: SAMEORIGIN
Accept-Ranges: bytes
Vary: Cookie,Accept-Encoding
X-Cache: Miss from cloudfront
Via: 1.1 6c46ad9c24627fa8c065620a1a7a52a9.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: EWR52-C1
X-Amz-Cf-Id: LqRBsWPmMMWNU4m66BY-LRfHuL1LI8Xrcd6unFAZ0VJJWWdO_I--uA==

<!DOCTYPE html>
  <!-- truncated for example -->

HTTP Overview

HTTP is Stateless

Each requested resource is a separate, independent, request to the server -- it is a stateless protocol.

HTTP Versions

W3C and Internet Engineering Task Force (IETF) oversees the Hypertext Transfer Protocol.

An HTTP Conversation

HTTP 1.1 Methods

HTTP Response Codes

HTTP 1.1 status codes commonly seen

The complete list:

Common Headers

Request (Browser)

Response (Server)

Looking at HTTP Under the Hood

Use your browser developer tools!

HTTP Header: Host

Problem: "Infinite" domain names; finite IP addresses.

Solution: "Virtual Hosts"

Example: all of the following names map to 140.247.197.241

Host Header

This is required for HTTP 1.1 requests.

HEAD /http/raspberry.gif HTTP/1.1
Host: cscie12.dce.harvard.edu

HTTP/1.1 200 OK
Date: Tue, 8 Apr 2020 20:23:14 GMT
Server: Apache/2.2 (Fedora)
Last-Modified: Wed, 06 Apr 2015 19:30:42 GMT
ETag: "461fb8-348c-a0f67c80"
Accept-Ranges: bytes
Content-Length: 13452
Connection: close
Content-Type: image/gif

Connection closed by foreign host.

HTTP/2

What are the key differences to HTTP/1.x?

From the HTTP/2 FAQ:

At a high level, HTTP/2:

Apache HTTP Server

apache httpd

Apache Configuration Overview

Scope of .htaccess files

Directives within .htaccess files apply to the directory that contains the .htaccess file and all its descendants.

Directives within the file,
/home/courses/j/h/jharvard/public_html/.htaccess
would apply to all files within and "under" the public_html directory for the user jharvard.

Directives within the file,
/home/courses/j/h/jharvard/public_html/books/.htaccess
would apply to all files within and "under" the public_html/books directory for the user jharvard.

Problems You Will Have with .htaccess files

500 Internal Server Error

500 Internal Server Error

:(

Problems You will encounter when using .htaccess files (Internal Server Error 500)

500 Internal Server Error
If you see begin seeing 500 Internal Server Error responses from the server after you have created or edited an .htaccess file, the most likely cause of the problem is incorrect permissions and/or an error in the directive syntax.
cscie12students% pwd
/home/courses/j/h/jharvard/public_html
cscie12students% ls -l .htaccess
-rw-------   1 jharvard  founder         349 Nov 27 00:03 .htaccess
cscie12students% chmod o+r .htaccess
cscie12students% ls -l ~/public_html/.htaccess
-rw----r--   1 jharvard  founder         349 Nov 27 00:03 .htaccess

Problems You will encounter when using .htaccess files (Can't see the .htaccess file)

You can't "see" your .htaccess file.

Apache Configuration Sections

Configuration directives can be limited by using "sections", such as

Within .htaccess

Note that only Files and FilesMatch can be used within .htaccess files.

Examples:

<Files .htaccess>
    Order allow,deny
    Deny from all
</Files>

Examples:

# deny access to any tilde backup files
<Files *~>
    Order allow,deny
    Deny from all
</Files>

Caching - Don't deliver content unnecessarily

Types of Caching

Caching Related Headers

Local cache and proxy-server cache.

Proxy Servers

Proxy Server

If-Modified-Since

A request for the Apache Software Foundation logo (http://apache.org/img/asf_logo.png) that is part of loading http://apache.org/foundation/
asf logo

Initial request:


GET /img/asf_logo.png HTTP/1.1
Host: apache.org
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
Accept: image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36
Referer: http://apache.org/foundation/
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8,ro;q=0.6

HTTP/1.1 200 OK
Date: Tue, 14 Apr 2015 22:40:52 GMT
Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Tue, 14 Apr 2015 16:08:47 GMT
ETag: "751e-513b1721525d0"
Accept-Ranges: bytes
Content-Length: 29982
Cache-Control: max-age=3600
Expires: Tue, 14 Apr 2015 23:40:52 GMT
Keep-Alive: timeout=30, max=98
Connection: Keep-Alive
Content-Type: image/png

After expiration, if still located in local cache, browser will make a conditional request:

GET /img/asf_logo.png HTTP/1.1
Host: apache.org
Connection: keep-alive
Accept: image/webp,*/*;q=0.8
If-None-Match: "751e-513b1721525d0"
If-Modified-Since: Tue, 14 Apr 2015 16:08:47 GMT
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36
Referer: http://apache.org/foundation/
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8,ro;q=0.6

HTTP/1.1 304 Not Modified
Date: Tue, 14 Apr 2015 22:42:51 GMT
Server: Apache/2.4.7 (Ubuntu)
Connection: Keep-Alive
Keep-Alive: timeout=30, max=100
ETag: "751e-513b1721525d0"
Expires: Tue, 14 Apr 2015 23:42:51 GMT
Cache-Control: max-age=3600

Expires HTTP Header

.htaccess
ExpiresActive On

ExpiresByType text/html   A3600
# HTML expires in 1 hour

ExpiresByType image/gif   A2592000
# GIF  expires in 30 days

ExpiresByType image/jpeg  A2592000
# JPEG expires in 30 days

ExpiresByType image/png   A2592000
# PNG  expires in 30 days

# types not specified
ExpiresDefault "now plus 1 day"
#  expires in 1 day  
Or, expire based upon modification time of document:
ExpiresActive On
ExpiresByType text/html   M86400
# HTML expires 1 day after it was last modified
ExpiresDefault M86400  

Do not cache

If you do not want your page cached, set these HTTP response headers:

Cache-control: no-cache
Pragma: no-cache
Expires: <set to now>  

In .htaccess in Apache, this would translate to:

ExpiresDefault "now"
Header set Pragma "no-cache"

Minify and Compress Content

Compress Content

mod_deflate compresses content before sending to web browser.

Simple use:

AddOutputFilterByType DEFLATE text/html
AddOutputFilterByType DEFLATE text/css
AddOutputFilterByType DEFLATE text/plain
AddOutputFilterByType DEFLATE text/xml
AddOutputFilterByType DEFLATE text/css
AddOutputFilterByType DEFLATE application/javascript

Does Compressing Help?


Harvard Summer School CSCI Course Listing

csci hss

firebug - page weight is 172 KB

Savings with Apache DEFLATE output filter

Friendly Errors

Apache Default "Not Found" 404 document:
404

"Not Found" 404 for Whitehouse
404 Not Found for Whitehouse

"Not Found" 404 for Whitehouse
404 Not Found for Harvard University

Custom Error Documents

.htaccess
ErrorDocument 401 /~jharvard/error/status401.html
ErrorDocument 403 /~jharvard/error/status403.html
ErrorDocument 404 /~jharvard/error/status404.html  

Friendly Ways to Get There

HTTP Redirect

Redirecting Requests

HTTP Status Codes:
301 Moved permanently
302 Moved temporarily

Redirecting client requests can be very useful:

Redirect

For cscie12.dce.harvard.edu the .htaccess file contains:

Redirect 302 /syllabus    https://harvard.instructure.com/courses/1812/assignments/syllabus
Try it:

Rewrite

mod_rewrite uses regular expressions to match on a pattern and rewrite incoming URLs to a new URL location.


Using mod_rewrite from within .htaccess

If you use RewriteRule from within an .htaccess files, you must use the RewriteBase directive.
See: http://httpd.apache.org/docs/current/mod/mod_rewrite.html#rewritebase

Example - Make Simple Links Instead of Complex Ones

Context: Parks and Recreation class offered and how to easily link directly to the class

Park and Rec system:
https://webtrac.littletonrec.com/wbwsc/webtrac.wsc/wbsearch.html

Link I can use with Rewrite rule
http://littletontrack.org/lpr-303107

RewriteEngine On
RewriteBase /
RewriteRule ^lpr-(.*)$ https://webtrac.littletonrec.com/wbwsc/webtrac.wsc/wbsearch.html?per=10&xxsearch=yes&xxdispmap=no+&xxmulti-list=&xxmulti-lbls=&xxrowid=&xxmod=ar&xxactivitynumber=$1&xxage=&xxgrade=&xxkeyword=&xxkeywordoption=N&xxtype=&xxcategory=&xxsortoption=ActivityNumber&xxdisplayoption=D&xxsubmit=Search

Example: Create Links that can always point to the correct place

Road Race Registration is done through a 3rd party service, SignMeUp

Redirect  /registration https://www.signmeup.com/site/reg/register.aspx?fid=B42VRH7

Redirect /map http://maps.google.com/maps/ms?ie=UTF8&hl=en&msa=0&msid=101999702593116464805.00046f1a27a9feb5aacaf&ll=42.52946,-71.485934&spn=0.018975,0.018239&z=15

URL Shortener Services

Search Engines and Optimization


Content: meta tags

meta tags and Metadata Guidelines (W3 EOWG)

meta elements from Harvard University:

<meta property="twitter:account_id" content="1491443782" />
<meta name="twitter:site" content="@harvard">
<meta name="twitter:creator" content="@harvard">
<meta name="twitter:card" content="summary_large_image">

<meta property="og:site_name" content="Harvard University"/>
<meta property="og:type" content="university"/>
<meta property="og:title" content="Harvard University"/>
<meta property="og:url"
content="https://www.harvard.edu"/>
<meta property="og:description" content="Harvard University is devoted to excellence in
teaching, learning, and research, and to developing leaders
in many disciplines who make a difference globally.
Harvard University is made up of 11 principal academic units."/>
<meta property="og:image"
content="https://www.harvard.edu/sites/default/files/default_images/harvard-social1200.jpg"/>

Search Robots, Crawlers, Spiders

Three mechanisms to instruct robots that visit your site:

  1. robots.txt file
  2. robots meta tag
  3. rel="nofollow" for a elements

robots.txt and Examples

Two directives:
Note: robots.txt must be at the root level of the server.

Check out some real robots.txt files!

Robots meta element in markup

<meta name="robots" content="noindex,nofollow" />

The Robots meta element can be used on a per document basis.

HTTP Header: X-Robots-Tag

Accessibility

The power of the Web is in its universality.
Access by everyone regardless of disability is an essential aspect.

Tim Berners-Lee, W3C Director and inventor of the World Wide Web

Things to know about:

Accessibility: Getting Started


Oh, The Places You'll Go!

oh the places you'll go

So, in general, I’d encourage you to think about three potential areas, and decide based on your interests which one you pursue next: