PDA

View Full Version : Apache LogFile parser, need help


harkerboy
February 15th, 2006, 07:58 AM
I'm a new user of Ruby, or strictly saying, I am a new guy in programming. I am currently doing a project which requires to write a parser of the log file of Apache web server in Ruby, so as to produce some page-visiting statistics, eg. which is the most/top10 popular page(s) in a specific day/week.

As a new user, again and again I read a few books and related docs but still have no idea how to write the parser class. I'm posting the problem here and expecting you experts can give me a hand.

So, the daily log file contains entries, each of which is following a standard format as shown below. And the "-" hyphen indicates the info. is not available.

clientIP identd userid time request statusCode objSize
for example,
124.61.45.136 - - [06/Oct/2005:17:03:08 +0100] "GET /interface/video-ipod.html HTTP/1.1" 200 5657
128.61.45.136 - - [06/Oct/2005:17:03:10 +0100] "GET /php/adlog.htm" 200 43

each piece of info. is separated by a space, each entry is a new line, the whole file consists of lines of entries in this format.

To parse this file, I know some basic idea:

1. to read through the *.log file, I use the code
source = File.new("12-01-2005.log", "r")
while (line = source.gets)

2. for each line(or say, entry), some parsing expressions:
for clientIP, eg. 124.61.45.136, can be expressed by /[0-9]+(.[0-9]+)?/
for identd, it is always be hyphen "-", so can be expressed by /-/
for userid, it is arbitary many chars, so, /[a-zA-Z0-9]+/
for time, eg.[06/Oct/2005:17:03:08 +0100], as it is starting with "[" and
end with "]", so can be expressed by /^[$]/
for request piece, eg. "GET /interface/video-ipod.html HTTP/1.1", can be /^"$"/
other two are simply just digits /[0-9]+/

3. the result of the parser class could probably be an array for further uses. that is, we write each of the parsed entry into an array of object "entry".

So, this is the first step I need to do, I learnt a little and these are what I designed. I think there should be something which are not correct, and somewhere that need to be improved. Also, as I have no experience in Ruby, I cannot construct all these and write these in a class. I am hereby hoping your experts could help me with the solution. Every little helps! Thanks very much!

steve_d555
February 15th, 2006, 03:03 PM
How 'bout something like:
source = File.new("12-01-2005.log", "r")
lines = []
source.each_line do |line|
lines << line.split(" ")
end


That is very simple and should return arrays of all those variables. You can then validate them by simple trying regexp's and checking true/false.

harkerboy
February 28th, 2006, 01:25 PM
Thanks for your idea. I've got an executable program now, but still something which is not perfect. There is no database(by mySQL) connection, which I really need one actually. What I mean is, I need to write the parsed requests to the database, and generate report from it. And the database can mainly contains things like time and requested URL, which should be shortlisted by parser.

Here below is the code. anyone got any idea?


#use it like this:

#ruby ruby_log_parser.rb access.log

#!/usr/local/bin/ruby
require 'date'

class LogEntry
attr_reader :host, :user, :auth, :date, :referrer, :ua, :rcode, :nbytes, :url
@@epat = Regexp.new('^(\S+) (\S+) (\S+) \[(.+)\] "(.+)" (\d{3}) (\d+|-) "(.*?)" "(.*?)"$');
@@rpat = Regexp.new('\A(\S+)\s+(\S+)\s+(\S+)\Z');
def initialize(line)
@host, @user, @auth, ds, request, code, bs, @referrer, @ua = @@epat.match(line).captures
@date = DateTime.strptime(ds, "%d/%b/%Y:%H:%M:%S %z");
@rcode = Integer(code)
@nbytes = (bs == "-" ? 0 : Integer(bs))

@method, @url, @proto = @@rpat.match(request).captures
end
def to_s()
"LogEntry[host:" + host + ", date:" + date.to_s + ", referrer:" + referrer +
", url:" + url + ", ua:" + ua + ", user:" + user + ", auth:" + auth +
", rcode:" + rcode.to_s + ", nbytes:" + nbytes.to_s + "]";
end
end

puts "Usage:: [ruby] webstat.rb <inpfile>" if ARGV.length < 1
inpfile = File.open(ARGV[0])
t1 = Time.now
nlines = 0
start_date = end_date = nil
le = nil
hosts = Hash.new(0)
urls = Hash.new(0)
referrers = Hash.new(0)
uastrings = Hash.new(0)
st = Time.now
while line = inpfile.gets
begin
le = LogEntry.new(line)
start_date = le.date if !start_date
hosts[le.host] += 1;
urls[le.url] += 1;
referrers[le.referrer] += 1;
uastrings[le.ua] += 1;
rescue
print "Log entry parse failed at line: ", (nlines + 1), ", error: ", $!, "\n"
print "LINE: ", line, "\n"
end
nlines += 1
if nlines % 4096 == 0
et = Time.now
puts "processed " + nlines.to_s + " lines ... (" + (et - st).to_s + " seconds)"
st = et
end
end
end_date = le.date
t2 = Time.now

printf("start_date:%s, end_date:%s\n", start_date.to_s, end_date.to_s);
printf("lines:%d, hosts:%d, urls:%d, referrers:%d, uastrings:%d\n",
nlines, hosts.length, urls.length, referrers.length, uastrings.length);
print "Processing time : ", (t2 - t1).to_s, " seconds\n"


# Do the sorting and display of top 20
def print_top20(label, h)
arr = h.sort { |a,b| b[1] <=> a[1] }
print "------------ " + label + " -------------\n"
for i in 0...20
printf("%2d. %s (%d)\n", i, arr[i][0], arr[i][1]) rescue nil
end
puts
end

t1 = Time.now
print_top20("Top 20 Hosts", hosts)
print_top20("Top 20 URLs", urls)
print_top20("Top 20 Referrers", referrers)
print_top20("Top 20 UA Strings", uastrings)
t2 = Time.now
print "Sort and Display time: ", (t2 - t1).to_s, " seconds\n"

steve_d555
February 28th, 2006, 02:51 PM
There is a mysql library for Ruby here (http://www.tmtm.org/en/mysql/ruby/). The documentation is not all too good but it should be fairly easy to connect and insert rows into a database.

rob
March 1st, 2006, 02:06 AM
If only the MySQL module had as much documentation as ActiveRecord :)

Which raises an interesting issue, if your app in any way has to do web reporting / user interface, why not move it into the Rails umbrella? That gives you instant (and pleasant) database access.