Sunday, April 23, 2006

Types of Join

Types of join

Occasionally there is a post on a forum asking what a certain type of join is all about, so I thought it would probably be good to have a stock explanation to refer people to so that I don't re-write near enough the same response each time the question arises.

First lets consider these two tables.


Key         Data
----------- ----------
1 a
2 b


Key         Data
----------- ----------
1 c
3 d
We can see that the only match is where Key is 1.


In an INNER JOIN that will be the only thing returned. If we use the query

SELECT A.[Key] AS aKey, A.Data AS aData, B.[Key] AS bKey, b.Data AS bData
INNER JOIN B ON a.[Key] = b.[Key]
the returned set will be
aKey        aData      bKey        bData
----------- ---------- ----------- ----------
1 a 1 c

In the case of the various outer joins non-matches will be returned also.


In a LEFT OUTER JOIN everything on the left side will be returned. Any matches on the right side will be returned also, but if there is no match on the right side then nulls are returned instead.

The query

SELECT A.[Key] AS aKey, A.Data AS aData, B.[Key] AS bKey, b.Data AS bData
LEFT OUTER JOIN B ON a.[Key] = b.[Key]
aKey        aData      bKey        bData
----------- ---------- ----------- ----------
1 a 1 c


The RIGHT OUTER JOIN is very similar to the LEFT OUTER JOIN, except that, of course, the matching is reversed. Everything on the right side is returned, and only matches on the left side are returned. Any non-matches will be filled with nulls on the left side.

The query

SELECT A.[Key] AS aKey, A.Data AS aData, B.[Key] AS bKey, b.Data AS bData
RIGHT OUTER JOIN B ON a.[Key] = b.[Key]
aKey        aData      bKey        bData
----------- ---------- ----------- ----------
1 a 1 c


A FULL OUTER JOIN returns a set containing all rows from either side, matched if possible, but nulls put in place if not.

The query

SELECT A.[Key] AS aKey, A.Data AS aData, B.[Key] AS bKey, b.Data AS bData
FULL OUTER JOIN B ON a.[Key] = b.[Key]
aKey        aData      bKey        bData
----------- ---------- ----------- ----------
1 a 1 c


The CROSS JOIN doesn't obey the same set of rules as the other joins. This is because it doesn't care about matching rows from either side, so there is no ON qualifier within the join clause. This is a simple join that joins all rows on the left side to all rows on the right side. Where the other joins cannot return more rows than exist in the most populous of the source tables, the CROSS JOIN will return the product of rows from each side. If you have 5 rows in Table A, and 6 rows in Table B it will return a set containing 30 rows.

The query

SELECT A.[Key] AS aKey, A.Data AS aData, B.[Key] AS bKey, b.Data AS bData
aKey        aData      bKey        bData
----------- ---------- ----------- ----------
1 a 1 c
2 b 1 c
1 a 3 d
2 b 3 d


Friday, April 21, 2006

Thursday, April 20, 2006

SED oneliners

Handy one-liners for SED

HANDY ONE-LINERS FOR SED (Unix stream editor)               Mar. 23, 2001
compiled by Eric Pement version 5.1
Latest version of this file is usually at:
This file is also available in Portuguese at:


# double space a file
sed G

# double space a file which already has blank lines in it. Output file
# should contain no more than one blank line between lines of text.
sed '/^$/d;G'

# triple space a file
sed 'G;G'

# undo double-spacing (assumes even-numbered lines are always blank)
sed 'n;d'


# number each line of a file (simple left alignment). Using a tab (see
# note on '\t' at end of file) instead of space will preserve margins.
sed = filename | sed 'N;s/\n/\t/'

# number each line of a file (number on left, right-aligned)
sed = filename | sed 'N; s/^/ /; s/ *\(.\{6,\}\)\n/\1 /'

# number each line of file, but only print numbers if line is not blank
sed '/./=' filename | sed '/./N; s/\n/ /'

# count lines (emulates "wc -l")
sed -n '$='


# IN UNIX ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format
sed 's/.$//' # assumes that all lines end with CR/LF
sed 's/^M$//' # in bash/tcsh, press Ctrl-V then Ctrl-M
sed 's/\x0D$//' # gsed 3.02.80, but top script is easier

# IN UNIX ENVIRONMENT: convert Unix newlines (LF) to DOS format
sed "s/$/`echo -e \\\r`/" # command line under ksh
sed 's/$'"/`echo \\\r`/" # command line under bash
sed "s/$/`echo \\\r`/" # command line under zsh
sed 's/$/\r/' # gsed 3.02.80

# IN DOS ENVIRONMENT: convert Unix newlines (LF) to DOS format
sed "s/$//" # method 1
sed -n p # method 2

# IN DOS ENVIRONMENT: convert DOS newlines (CR/LF) to Unix format
# Cannot be done with DOS versions of sed. Use "tr" instead.
tr -d \r outfile # GNU tr version 1.22 or higher

# delete leading whitespace (spaces, tabs) from front of each line
# aligns all text flush left
sed 's/^[ \t]*//' # see note on '\t' at end of file

# delete trailing whitespace (spaces, tabs) from end of each line
sed 's/[ \t]*$//' # see note on '\t' at end of file

# delete BOTH leading and trailing whitespace from each line
sed 's/^[ \t]*//;s/[ \t]*$//'

# insert 5 blank spaces at beginning of each line (make page offset)
sed 's/^/ /'

# align all text flush right on a 79-column width
sed -e :a -e 's/^.\{1,78\}$/ &/;ta' # set at 78 plus 1 space

# center all text in the middle of 79-column width. In method 1,
# spaces at the beginning of the line are significant, and trailing
# spaces are appended at the end of the line. In method 2, spaces at
# the beginning of the line are discarded in centering the line, and
# no trailing spaces appear at the end of lines.
sed -e :a -e 's/^.\{1,77\}$/ & /;ta' # method 1
sed -e :a -e 's/^.\{1,77\}$/ &/;ta' -e 's/\( *\)\1/\1/' # method 2

# substitute (find and replace) "foo" with "bar" on each line
sed 's/foo/bar/' # replaces only 1st instance in a line
sed 's/foo/bar/4' # replaces only 4th instance in a line
sed 's/foo/bar/g' # replaces ALL instances in a line
sed 's/\(.*\)foo\(.*foo\)/\1bar\2/' # replace the next-to-last case
sed 's/\(.*\)foo/\1bar/' # replace only the last case

# substitute "foo" with "bar" ONLY for lines which contain "baz"
sed '/baz/s/foo/bar/g'

# substitute "foo" with "bar" EXCEPT for lines which contain "baz"
sed '/baz/!s/foo/bar/g'

# change "scarlet" or "ruby" or "puce" to "red"
sed 's/scarlet/red/g;s/ruby/red/g;s/puce/red/g' # most seds
gsed 's/scarlet\|ruby\|puce/red/g' # GNU sed only

# reverse order of lines (emulates "tac")
# bug/feature in HHsed v1.5 causes blank lines to be deleted
sed '1!G;h;$!d' # method 1
sed -n '1!G;h;$p' # method 2

# reverse each character on the line (emulates "rev")
sed '/\n/!G;s/\(.\)\(.*\n\)/&\2\1/;//D;s/.//'

# join pairs of lines side-by-side (like "paste")
sed '$!N;s/\n/ /'

# if a line ends with a backslash, append the next line to it
sed -e :a -e '/\\$/N; s/\\\n//; ta'

# if a line begins with an equal sign, append it to the previous line
# and replace the "=" with a single space
sed -e :a -e '$!N;s/\n=/ /;ta' -e 'P;D'

# add commas to numeric strings, changing "1234567" to "1,234,567"
gsed ':a;s/\B[0-9]\{3\}\>/,&/;ta' # GNU sed
sed -e :a -e 's/\(.*[0-9]\)\([0-9]\{3\}\)/\1,\2/;ta' # other seds

# add commas to numbers with decimal points and minus signs (GNU sed)
gsed ':a;s/\(^\|[^0-9.]\)\([0-9]\+\)\([0-9]\{3\}\)/\1\2,\3/g;ta'

# add a blank line every 5 lines (after lines 5, 10, 15, 20, etc.)
gsed '0~5G' # GNU sed only
sed 'n;n;n;n;G;' # other seds


# print first 10 lines of file (emulates behavior of "head")
sed 10q

# print first line of file (emulates "head -1")
sed q

# print the last 10 lines of a file (emulates "tail")
sed -e :a -e '$q;N;11,$D;ba'

# print the last 2 lines of a file (emulates "tail -2")
sed '$!N;$!D'

# print the last line of a file (emulates "tail -1")
sed '$!d' # method 1
sed -n '$p' # method 2

# print only lines which match regular expression (emulates "grep")
sed -n '/regexp/p' # method 1
sed '/regexp/!d' # method 2

# print only lines which do NOT match regexp (emulates "grep -v")
sed -n '/regexp/!p' # method 1, corresponds to above
sed '/regexp/d' # method 2, simpler syntax

# print the line immediately before a regexp, but not the line
# containing the regexp
sed -n '/regexp/{g;1!p;};h'

# print the line immediately after a regexp, but not the line
# containing the regexp
sed -n '/regexp/{n;p;}'

# print 1 line of context before and after regexp, with line number
# indicating where the regexp occurred (similar to "grep -A1 -B1")
sed -n -e '/regexp/{=;x;1!p;g;$!N;p;D;}' -e h

# grep for AAA and BBB and CCC (in any order)
sed '/AAA/!d; /BBB/!d; /CCC/!d'

# grep for AAA and BBB and CCC (in that order)
sed '/AAA.*BBB.*CCC/!d'

# grep for AAA or BBB or CCC (emulates "egrep")
sed -e '/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d # most seds
gsed '/AAA\|BBB\|CCC/!d' # GNU sed only

# print paragraph if it contains AAA (blank lines separate paragraphs)
# HHsed v1.5 must insert a 'G;' after 'x;' in the next 3 scripts below
sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;'

# print paragraph if it contains AAA and BBB and CCC (in any order)
sed -e '/./{H;$!d;}' -e 'x;/AAA/!d;/BBB/!d;/CCC/!d'

# print paragraph if it contains AAA or BBB or CCC
sed -e '/./{H;$!d;}' -e 'x;/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d
gsed '/./{H;$!d;};x;/AAA\|BBB\|CCC/b;d' # GNU sed only

# print only lines of 65 characters or longer
sed -n '/^.\{65\}/p'

# print only lines of less than 65 characters
sed -n '/^.\{65\}/!p' # method 1, corresponds to above
sed '/^.\{65\}/d' # method 2, simpler syntax

# print section of file from regular expression to end of file
sed -n '/regexp/,$p'

# print section of file based on line numbers (lines 8-12, inclusive)
sed -n '8,12p' # method 1
sed '8,12!d' # method 2

# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files

# beginning at line 3, print every 7th line
gsed -n '3~7p' # GNU sed only
sed -n '3,${p;n;n;n;n;n;n;}' # other seds

# print section of file between two regular expressions (inclusive)
sed -n '/Iowa/,/Montana/p' # case sensitive


# print all of file EXCEPT section between 2 regular expressions
sed '/Iowa/,/Montana/d'

# delete duplicate, consecutive lines from a file (emulates "uniq").
# First line in a set of duplicate lines is kept, rest are deleted.
sed '$!N; /^\(.*\)\n\1$/!P; D'

# delete duplicate, nonconsecutive lines from a file. Beware not to
# overflow the buffer size of the hold space, or else use GNU sed.
sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'

# delete the first 10 lines of a file
sed '1,10d'

# delete the last line of a file
sed '$d'

# delete the last 2 lines of a file
sed 'N;$!P;$!D;$d'

# delete the last 10 lines of a file
sed -e :a -e '$d;N;2,10ba' -e 'P;D' # method 1
sed -n -e :a -e '1,10!{P;N;D;};N;ba' # method 2

# delete every 8th line
gsed '0~8d' # GNU sed only
sed 'n;n;n;n;n;n;n;d;' # other seds

# delete ALL blank lines from a file (same as "grep '.' ")
sed '/^$/d' # method 1
sed '/./!d' # method 2

# delete all CONSECUTIVE blank lines from file except the first; also
# deletes all blank lines from top and end of file (emulates "cat -s")
sed '/./,/^$/!d' # method 1, allows 0 blanks at top, 1 at EOF
sed '/^$/N;/\n$/D' # method 2, allows 1 blank at top, 0 at EOF

# delete all CONSECUTIVE blank lines from file except the first 2:
sed '/^$/N;/\n$/N;//D'

# delete all leading blank lines at top of file
sed '/./,$!d'

# delete all trailing blank lines at end of file
sed -e :a -e '/^\n*$/{$d;N;ba' -e '}' # works on all seds
sed -e :a -e '/^\n*$/N;/\n$/ba' # ditto, except for gsed 3.02*

# delete the last line of each paragraph
sed -n '/^$/{p;h;};/./{x;/./p;}'


# remove nroff overstrikes (char, backspace) from man pages. The 'echo'
# command may need an -e switch if you use Unix System V or bash shell.
sed "s/.`echo \\\b`//g" # double quotes required for Unix environment
sed 's/.^H//g' # in bash/tcsh, press Ctrl-V and then Ctrl-H
sed 's/.\x08//g' # hex expression for sed v1.5

# get Usenet/e-mail message header
sed '/^$/q' # deletes everything after first blank line

# get Usenet/e-mail message body
sed '1,/^$/d' # deletes everything up to first blank line

# get Subject header, but remove initial "Subject: " portion
sed '/^Subject: */!d; s///;q'

# get return address header
sed '/^Reply-To:/q; /^From:/h; /./d;g;q'

# parse out the address proper. Pulls out the e-mail address by itself
# from the 1-line return address header (see preceding script)
sed 's/ *(.*)//; s/>.*//; s/.*[:<] *//'

# add a leading angle bracket and space to each line (quote a message)
sed 's/^/> /'

# delete leading angle bracket & space from each line (unquote a message)
sed 's/^> //'

# remove most HTML tags (accommodates multiple-line tags)
sed -e :a -e 's/<[^>]*>//g;/
# extract multi-part uuencoded binaries, removing extraneous header
# info, so that only the uuencoded portion remains. Files passed to
# sed must be passed in the proper order. Version 1 can be entered
# from the command line; version 2 can be made into an executable
# Unix shell script. (Modified from a script by Rahul Dhesi.)
sed '/^end/,/^begin/d' file1 file2 ... fileX | uudecode # vers. 1
sed '/^end/,/^begin/d' "$@" | uudecode # vers. 2

# zip up each .TXT file individually, deleting the source file and
# setting the name of each .ZIP file to the basename of the .TXT file
# (under DOS: the "dir /b" switch returns bare filenames in all caps).
echo @echo off >zipup.bat
dir /b *.txt | sed "s/^\(.*\)\.TXT/pkzip -mo \1 \1.TXT/" >>zipup.bat

TYPICAL USE: Sed takes one or more editing commands and applies all of
them, in sequence, to each line of input. After all the commands have
been applied to the first input line, that line is output and a second
input line is taken for processing, and the cycle repeats. The
preceding examples assume that input comes from the standard input
device (i.e, the console, normally this will be piped input). One or
more filenames can be appended to the command line if the input does
not come from stdin. Output is sent to stdout (the screen). Thus:

cat filename | sed '10q' # uses piped input
sed '10q' filename # same effect, avoids a useless "cat"
sed '10q' filename > newfile # redirects output to disk

For additional syntax instructions, including the way to apply editing
commands from a disk file instead of the command line, consult "sed &
awk, 2nd Edition," by Dale Dougherty and Arnold Robbins (O'Reilly,
1997;, "UNIX Text Processing," by Dale Dougherty
and Tim O'Reilly (Hayden Books, 1987) or the tutorials by Mike Arst
distributed in U-SEDIT2.ZIP (many sites). To fully exploit the power
of sed, one must understand "regular expressions." For this, see
"Mastering Regular Expressions" by Jeffrey Friedl (O'Reilly, 1997).
The manual ("man") pages on Unix systems may be helpful (try "man
sed", "man regexp", or the subsection on regular expressions in "man
ed"), but man pages are notoriously difficult. They are not written to
teach sed use or regexps to first-time users, but as a reference text
for those already acquainted with these tools.

QUOTING SYNTAX: The preceding examples use single quotes ('...')
instead of double quotes ("...") to enclose editing commands, since
sed is typically used on a Unix platform. Single quotes prevent the
Unix shell from intrepreting the dollar sign ($) and backquotes
(`...`), which are expanded by the shell if they are enclosed in
double quotes. Users of the "csh" shell and derivatives will also need
to quote the exclamation mark (!) with the backslash (i.e., \!) to
properly run the examples listed above, even within single quotes.
Versions of sed written for DOS invariably require double quotes
("...") instead of single quotes to enclose editing commands.

USE OF '\t' IN SED SCRIPTS: For clarity in documentation, we have used
the expression '\t' to indicate a tab character (0x09) in the scripts.
However, most versions of sed do not recognize the '\t' abbreviation,
so when typing these scripts from the command line, you should press
the TAB key instead. '\t' is supported as a regular expression
metacharacter in awk, perl, and HHsed, sedmod, and GNU sed v3.02.80.

VERSIONS OF SED: Versions of sed do differ, and some slight syntax
variation is to be expected. In particular, most do not support the
use of labels (:name) or branch instructions (b,t) within editing
commands, except at the end of those commands. We have used the syntax
which will be portable to most users of sed, even though the popular
GNU versions of sed allow a more succinct syntax. When the reader sees
a fairly long command such as this:

sed -e '/AAA/b' -e '/BBB/b' -e '/CCC/b' -e d

it is heartening to know that GNU sed will let you reduce it to:

sed '/AAA/b;/BBB/b;/CCC/b;d' # or even
sed '/AAA\|BBB\|CCC/b;d'

In addition, remember that while many versions of sed accept a command
like "/one/ s/RE1/RE2/", some do NOT allow "/one/! s/RE1/RE2/", which
contains space before the 's'. Omit the space when typing the command.

OPTIMIZING FOR SPEED: If execution speed needs to be increased (due to
large input files or slow processors or hard disks), substitution will
be executed more quickly if the "find" expression is specified before
giving the "s/.../.../" instruction. Thus:

sed 's/foo/bar/g' filename # standard replace command
sed '/foo/ s/foo/bar/g' filename # executes more quickly
sed '/foo/ s//bar/g' filename # shorthand sed syntax

On line selection or deletion in which you only need to output lines
from the first part of the file, a "quit" command (q) in the script
will drastically reduce processing time for large files. Thus:

sed -n '45,50p' filename # print line nos. 45-50 of a file
sed -n '51q;45,50p' filename # same, but executes much faster

If you have any additional scripts to contribute or if you find errors
in this document, please send e-mail to the compiler. Indicate the
version of sed you used, the operating system it was compiled for, and
the nature of the problem. Various scripts in this file were written
or contributed by:

Al Aab # "seders" list moderator
Edgar Allen # various
Yiorgos Adamopoulos
Dale Dougherty # author of "sed & awk"
Carlos Duarte # author of "do it with sed"
Eric Pement # author of this document
Ken Pizzini # author of GNU sed v3.02
S.G. Ravenhall # great de-html script
Greg Ubben # many contributions & much help

Tuesday, April 18, 2006

Elevator Design Problem

UML By Examples



    Table of Contents

    0. Introduction

    The aim of this tutorial is to show how to use UML in "real" software development environment.
    1. Elevator Problem

    A product is to be installed to control elevators in a building with m floors. The problem concerns the logic required to move elevators between floors according to the following constraints:

    • Each elevator has a set of m buttons, one for each floor. These illuminate when pressed and cause the elevator to visit the corresponding floor. The illumination is canceled when the elevator visits the corresponding floor.
    • Each floor, except the first floor and top floor has two buttons, one to request and up-elevator and one to request a down-elevator. These buttons illuminate when pressed. The illumination is canceled when an elevator visits the floor and then moves in the desired direction.
    • When an elevator has no requests, it remains at its current floor with its doors closed.

    2. Unified Modeling Language

    UML is a modeling language that only specifies semantics and notation but no process is currently defined. Thus, we decided to do the analysis as follows;

    • Use Case Diagram
    • Class Diagram
    • Sequence Diagram
    • Collabration Diagram
    • State Diagram
    3. Analysis

    3.1. Use case diagram

    Use case description:

    • A generalized description of how a system will be used.
    • Provides an overview of the intended functionality of the system.
    • Understandable by laymen as well as professionals.
    Use Case Diagram:

    Elevator basic scenario that can be extracted from Use Case Diagram:

    • Passenger pressed floor button
    • Elevator system detects floor button pressed
    • Elevator moves to the floor
    • Elevator doors open
    • Passenger gets in and presses elevator button
    • Elevator doors closes
    • Elevator moves to required floor
    • Elevator doors open
    • Passenger gets out
    • Elevator doors closes

    3.2. Class Diagram

    Class diagrams show the static structure of the object, their internal structure, and their relationships.

    Class diagram:

    3.3. State diagram

    A state diagram shows the sequences of states an object goes through during it's life cycle in response to stimuli, together with its responses and actions.

    4. Design

    The design phase should produce the detailed class diagrams, collaboration diagrams, sequence diagrams, state diagrams, and activity diagram. However, the elevator problem is too simple for an activity diagram. Thus, we are not using an activity diagram for the elevator problem.

    4.1. Sequence Diagram

    A sequence diagram and collaboration diagram conveys similar information but expressed in different ways. A Sequence diagram shows the explicit sequence of messages suitable for modeling a real-time system, whereas a collobration diagram shows the relationships between objects.

    Sequence Diagrams:

    Sequence Diagram for Serving Elevator Button

    Sequence Diagram for Serving Door Button

    4.2. Collaboration diagram
    • Describes the set of interactions between classes or types
    • Shows the relationships among objects
    Collabration diagrams:
    Collabration Digaram for Serving Elevator Button
    Collabration Digaram for Serving Door Button

    5. Detail Design

    5.1. Detail Class Diagram

    5.2. Detail Operation Description

      Module Name Elevator_Control::Elevator_control_loop
      Module Type Method
      Input Argument None
      Output Argument None
      Error Message None
      File Access None
      File Change None
      Method Invoke button::illuminate, button::cancel_illumination,
      door::open, door::close, elevator::move, elevator::stop

    5.3. Pseudo-Code
      void elevator_control (void)
      while a button has been pressed
      if button not on
      update request list;
      else if elevator is moving up
      if there is no request to stop at floor f
      Elevator::move one floor up;


      6. Acknowledgement

      This example was developed for topic in software engineering in Vanderbilt University by myself and my best friends:

Monday, April 17, 2006

What is Windows Collation

Windows collations define rules for storing character data based on an associated Windows locale.

Sunday, April 16, 2006

Semaphore and Mutex

What's the difference between semaphore and mutex?

A: Rules of thumb:

- By semaphore we actually mean counting semaphore. By mutex we actually mean mutex semaphore. A mutex is essentially a binary semaphore. You can replace any Mutex with a Semaphore of MaxCount 1.

- Mutexes are usually more efficient than binary semaphores.

- Mutex has an owner concept: unlocking a mutex can be only done by the thread that locked (or, equivalently, "owns") the mutex.

- However, though I don't think it's a real issue in this particular case either, mutex can do fancy things with the thread that has locked the mutex (mutex owner thread), like raising its priority to the highest one of the threads waiting for the mutex to prevent so called priority inversion. (huican ping notion, not in the original version)

Saturday, April 15, 2006


Thursday, April 13, 2006

Algorithms in Latex

Preditive Accuracy

Several possible criteria for evaluating a learning algorithm:
  • Predictive accuracy of classifier
  • Speed of learner
  • Speed of classifier
  • Space requirements

Most common criterion is predictive accuracy

Wednesday, April 12, 2006



Maximum Likelihood Estimation (MLE)


Now we are in a position to introduce the concept of likelihood. If the probability of an event X dependent on model parameters p is written
   P ( X | p )

then we would talk about the likelihood
   L ( p | X )

that is, the likelihood of the parameters given the data. For most sensible models, we will find that certain data are more probable than other data. The aim of maximum likelihood estimation is to find the parameter value(s) that makes the observed data most likely. This is because the likelihood of the parameters given the data is defined to be equal to the probability of the data given the parameters (nb. technically, they are proportional to each other, but this does not affect the principle). If we were in the business of making predictions based on a set of solid assumptions, then we would be interested in probabilities - the probability of certain outcomes occurring or not occurring. However, in the case of data analysis, we have already observed all the data: once they have been observed they are fixed, there is no 'probabilistic' part to them anymore (the word data comes from the Latin word meaning 'given'). We are much more interested in the likelihood of the model parameters that underly the fixed data.
Knowing parameters -> Prediction of outcome

Observation of data -> Estimation of parameters

A simple example of MLE

To re-iterate, the simple principle of maximum likelihood parameter estimation is this: find the parameter values that make the observed data most likely. How would we go about this in a simple coin toss experiment? That is, rather than assume that p is a certain value (0.5) we might wish to find the maximum likelihood estimate (MLE) of p, given a specific dataset. Beyond parameter estimation, the likelihood framework allows us to make tests of parameter values. For example, we might want to ask whether or not the estimated p differs significantly from 0.5 or not. This test is essentially asking: is there evidence that the coin is biased? We will see how such tests can be performed when we introduce the concept of a likelihood ratio test below. Say we toss a coin 100 times and observe 56 heads and 44 tails. Instead of assuming that p is 0.5, we want to find the MLE for p. Then we want to ask whether or not this value differs significantly from 0.50. How do we do this? We find the value for p that makes the observed data most likely. As mentioned, the observed data are now fixed. They will be constants that are plugged into our binomial probability model :-
  • n = 100 (total number of tosses)
  • h = 56 (total number of heads)
Imagine that p was 0.5. Plugging this value into our probability model as follows :-
But what if p was 0.52 instead?
So from this we can conclude that p is more likely to be 0.52 than 0.5. We can tabulate the likelihood for different parameter values to find the maximum likelihood estimate of p:
                  p       L
0.48 0.0222
0.50 0.0389
0.52 0.0581
0.54 0.0739
0.56 0.0801
0.58 0.0738
0.60 0.0576
0.62 0.0378
If we graph these data across the full range of possible values for p we see the following likelihood surface.
We see that the maximum likelihood estimate for p seems to be around 0.56. In fact, it is exactly 0.56, and it is easy to see why this makes sense in this trivial example. The best estimate for p from any one sample is clearly going to be the proportion of heads observed in that sample. (In a similar way, the best estimate for the population mean will always be the sample mean.) So why did we waste our time with the maximum likelihood method? In such a simple case as this, nobody would use maximum likelihood estimation to evaluate p. But not all problems are this simple! As we shall see, the more complex the model and the greater the number of parameters, it often becomes very difficult to make even reasonable guesses at the MLEs. The likelihood framework conceptually takes all of this in its stride, however, and this is what makes it the work-horse of many modern statistical methods.

Return to front page

Site created by S.Purcell, last updated 21.09.2000

Monday, April 10, 2006

Normal Forms Again II

  • The First Normal Form (1NF) addresses the structure of an isolated table.
  • The Second (2NF), Third (3NF), and Boyce-Codd (BCNF) Normal Forms address one-to-one and one-to-many relationships.
  • The Fourth (4NF) and Fifth (5NF) Normal Forms deal with many-to-many relationships.

Good implementations in C

Thursday, April 06, 2006

Expectation Maximization Theory

Monday, April 03, 2006

Database Design Relationships

  • 1-1:

    • In order to create the relationship between two relations with a 1-1 relationship - post the primary key of one table into the other table, it doesn't matter which way around.

    • The key placed into the other relation is called a foreign key.

    • MP{name, dob, party, constituencyName*}

    • Constituency{name, size}

  • 1-Many:

    • To create the relationship - the primary key of the 1 side is posted to the many side as a foreign key.

    • LibraryBook{id_no, name}

    • Reader{membership_no, name, book_id*}

    • Imagine if the many side was posted to the one side instead.

  • Many-Many (M-N):

    • To create the relationship we can't add a foreign key, because will result in repeted data.

    • Instead, create an extra link table between the two relations, which has as its primary key the foreign keys from the two relations.

    • Actor{id_no, name}

    • Film{id_no, title}

    • Cast{actor_no*, film_no*}

    • The foreign keys from the original tables form a compound primary key in the link table.

    • The link table has a 1-many relationship with both the original tables.

    • This method prevents duplications of the relationship - actors could not be entered for the same film twice because the foreign keys are the primary key.

    • A link table can have its own attributes, if it is appropriate for the relationship to have attributes.

    • E.g. the Cast table could have an attribute 'time_spent_on_set'