Talk:Link transit: Difference between revisions

4,098 bytes added ,  9 June 2005
I don't currently have C++-compiler installed so I can't run that old code that Tim Starling supplied us with
(how to)
(I don't currently have C++-compiler installed so I can't run that old code that Tim Starling supplied us with)
 
(2 intermediate revisions by one other user not shown)
Line 30: Line 30:


:It looks like C to me and it looks like that the Main() takes standard httpd.log as input. I'll run this on our logs sometime when I have the time. Kinda busy now. --[[User:Juxo|Juxo]] 13:02, 3 Sep 2004 (EEST)
:It looks like C to me and it looks like that the Main() takes standard httpd.log as input. I'll run this on our logs sometime when I have the time. Kinda busy now. --[[User:Juxo|Juxo]] 13:02, 3 Sep 2004 (EEST)
::Juxo, did you ever run this on the logs?  If so then please put the output at [[Consumerium:attention]].  The word attention is used for compatibility with [[attention.xml]] which should include this type of information eventually for all [[web service]]s that support it.
:::No, I haven't run the code and there is this little problem that I don't have  a C++ compiler installed I tried once to install it but the installation said that there were some dependencies that were unmet or something similar. Right now I'm I don't have the installation disks with me so I can't install the compiler. --[[User:Juxo|Juxo]] 10:34, 9 Jun 2005 (GMT)


::It's C++. It outputs two sections separated by a double linefeed. The first is an indexed list of URLs. The second has three values on each line: index from, index to and the transit count. The idea is that you would read all this into a relational database with an index on all three columns, then perform whatever analysis you need to perform. -- [[User:Tim Starling|Tim Starling]] 07:13, 4 Sep 2004 (EEST)
::It's C++. It outputs two sections separated by a double linefeed. The first is an indexed list of URLs. The second has three values on each line: index from, index to and the transit count. The idea is that you would read all this into a relational database with an index on all three columns, then perform whatever analysis you need to perform. -- [[User:Tim Starling|Tim Starling]] 07:13, 4 Sep 2004 (EEST)
Line 184: Line 188:
}
}
</pre>
</pre>
----------------
''from [http://en.wikipedia.org/w/wiki.phtml?title=Wikipedia:Village_pump&oldid=5582713#Downloading_log_files.3F Wikipedia Village pump discussions 28 August to 1 September 2004 ]''
Does anybody know where I can download a few days worth of Wikipedia hit log files? I'd like to do some analysis on them, but I can't figure out where they're kept. Thanks, --William Pietri 00:30, 28 Aug 2004 (UTC)
We don't have daily log files. We have log files for the current month, updated nightly. If you take the difference from one nightly log file to the next, you can calculate the day's hits. →Raul654 00:37, Aug 28, 2004 (UTC)
Ah, I was looking for the raw log files, with Referer information. I was curious do some analysis on Wikipedia usage patterns, including how people come in via search engines and what they do after that point. But I gather you're saying that raw hit logs just aren't collected on a regular basis? In that case, are there older ones around, perhaps collected during an optimization run? --William Pietri 00:44, 28 Aug 2004 (UTC)
The raw logs are available only to wikimedia's developers. Publishing them would be a major issue, as in general we view people's IPs, the sites and pages they visit, and what they do here to be private information. A lot of people would be very pissed off if the raw logs were available to the casual visitor. I suppose if there were such a thing as a decent unified logfile sanitiser (which removed/abstracted IP information) then publishing that would be an option. But if I were running such a sanitiser, I'd surely have it purge the referrer information too, as (occasionally) there's juicy info in the referring URL (particularly when the referrer is an email in a webmail service, and sometimes a search query). That said, there doesn't seem to be such a program anyway (I guess I don't know what it would be called, so it's rather hard to google for it). The problem of examining the log files is compounded by wikimedia's server setup - most pages are served ex-cache by one of the frontline squid servers; so we'd need to publish a number of the (huge) squid logs too. It would indeed be a very interesting exercise to run some analyses of the logs, as there's all kinds of things we don't know about how visitors and search engines enter, move around, and leave the site. If we could figure out an acceptable way to do things then there is much value to be mined. -- Finlay McWalter | Talk 00:58, 28 Aug 2004 (UTC)
Right, logs are private. You can find some statistics extracted from log files, including a list of referrers, on the Webalizer stats page (http://wikimedia.org/stats/en.wikipedia.org/). -- Tim Starling 01:28, Aug 28, 2004 (UTC)
Interesting. I hadn't thought about the privacy issues, but they make complete sense to me. If I get enough time to do the research properly, I'll contact y'all with a proposal to either A) write a log sanitizer that meets your standards, or B) write an analysis program that you guys run, so I never see the log data. Also, thanks for the link to the Webalizer pages; I had missed those the first time through. Thanks, --William Pietri 20:54, 30 Aug 2004 (UTC)
When you write your log file analysis program, keep in mind that the squid logs for each day come to about five gigabytes. Also, depending on how much spare CPU time there is, running the analyzer might be done only intermittently, as has been the case with the Webalizer statistics. It will be interesting to see the results of this. Until now, everybody wanting to do this has been all talk and no code. -- Jeronim 05:26, 1 Sep 2004 (UTC)
9,842

edits