Talk:Link transit: Difference between revisions

    From Consumerium development wiki R&D Wiki
    (how to)
    (including Wikipedia Village pump discussions regarding logs, between William Pietri (troll), Raul 654, Jeronim, and Tim Starling (sysop))
    Line 184: Line 184:
    }
    }
    </pre>
    </pre>
    ----------------
    ''from [http://en.wikipedia.org/w/wiki.phtml?title=Wikipedia:Village_pump&oldid=5582713#Downloading_log_files.3F Wikipedia Village pump discussions 28 August to 1 September 2004 ]''
    Does anybody know where I can download a few days worth of Wikipedia hit log files? I'd like to do some analysis on them, but I can't figure out where they're kept. Thanks, --William Pietri 00:30, 28 Aug 2004 (UTC)
    We don't have daily log files. We have log files for the current month, updated nightly. If you take the difference from one nightly log file to the next, you can calculate the day's hits. →Raul654 00:37, Aug 28, 2004 (UTC)
    Ah, I was looking for the raw log files, with Referer information. I was curious do some analysis on Wikipedia usage patterns, including how people come in via search engines and what they do after that point. But I gather you're saying that raw hit logs just aren't collected on a regular basis? In that case, are there older ones around, perhaps collected during an optimization run? --William Pietri 00:44, 28 Aug 2004 (UTC)
    The raw logs are available only to wikimedia's developers. Publishing them would be a major issue, as in general we view people's IPs, the sites and pages they visit, and what they do here to be private information. A lot of people would be very pissed off if the raw logs were available to the casual visitor. I suppose if there were such a thing as a decent unified logfile sanitiser (which removed/abstracted IP information) then publishing that would be an option. But if I were running such a sanitiser, I'd surely have it purge the referrer information too, as (occasionally) there's juicy info in the referring URL (particularly when the referrer is an email in a webmail service, and sometimes a search query). That said, there doesn't seem to be such a program anyway (I guess I don't know what it would be called, so it's rather hard to google for it). The problem of examining the log files is compounded by wikimedia's server setup - most pages are served ex-cache by one of the frontline squid servers; so we'd need to publish a number of the (huge) squid logs too. It would indeed be a very interesting exercise to run some analyses of the logs, as there's all kinds of things we don't know about how visitors and search engines enter, move around, and leave the site. If we could figure out an acceptable way to do things then there is much value to be mined. -- Finlay McWalter | Talk 00:58, 28 Aug 2004 (UTC)
    Right, logs are private. You can find some statistics extracted from log files, including a list of referrers, on the Webalizer stats page (http://wikimedia.org/stats/en.wikipedia.org/). -- Tim Starling 01:28, Aug 28, 2004 (UTC)
    Interesting. I hadn't thought about the privacy issues, but they make complete sense to me. If I get enough time to do the research properly, I'll contact y'all with a proposal to either A) write a log sanitizer that meets your standards, or B) write an analysis program that you guys run, so I never see the log data. Also, thanks for the link to the Webalizer pages; I had missed those the first time through. Thanks, --William Pietri 20:54, 30 Aug 2004 (UTC)
    When you write your log file analysis program, keep in mind that the squid logs for each day come to about five gigabytes. Also, depending on how much spare CPU time there is, running the analyzer might be done only intermittently, as has been the case with the Webalizer statistics. It will be interesting to see the results of this. Until now, everybody wanting to do this has been all talk and no code. -- Jeronim 05:26, 1 Sep 2004 (UTC)

    Revision as of 20:19, 5 June 2005

    I'm not sure if our apache is configured to even log internal clicks. Maybe, maybe not. Anyways I don't currently have any software to report or analyze this information. If you know of GPL or otherwise free tools for digging this information from httpd logs pls post here --Juxo 16:47, 1 Sep 2004 (EEST)

    This tends to be expensive software run by ad server companies. But it is certainly in use in all publicly traded search engines like Yahoo and Google, in fact, you can see the "imgurl" they use to track say which queries led to which image lookups.
    start with logging the HTTP referrer. Uncomment this line in apache's httpd.conf: #CustomLog logs/referer_log referer then rotate this file and publish it.

    I wrote up a basic program to perform this kind of analysis on log files, but I'm not sure why you think it would be useful for either contributors or Bomis. It's certainly not a commonly requested feature. Wouldn't view count data be more useful than link transit data?

    Both are useful for the same reasons. And both are not available. "Server load" is a lousy excuse, when Wikimedia could raise all the money it needed for hardware with an independent board.
    You say both are useful for the same reasons. What reasons are those? -- Tim Starling 07:13, 4 Sep 2004 (EEST)
    A serious encyclopedia or any journal would care which pages were reviewed, and which were reviewed from which others, and how often, and what connections were of interest to readers. It's kind of embarassing to have to say that out loud.
    Of course we care about review,
    Trolls doubt that very much.
    but wouldn't that be better served by popularity data than link transit data?
    Restoring the information about page popularity is also quite useful. But more useful is information about paths between, i.e. that the single most popular path was for instance Wikipedia -> GFDL corpus -> sysop vandalism -> Wikimedia -> tabloid journalism -> libel suit -> easy street would demonstrate that people actually had understood the subjects correctly and went to the next most logical page to learn the next most important thing and ended up in the most logical place where most reasonable people would want to go given that starting point, while a path from Zionism -> Robert Kaiser -> Nazipedia -> Anti-Anti-Anti-Anti-Zionism would perhaps indicate a quite confused person who was pretty much being subjected to a pile of propaganda and would come away with a quite different impression of the subject than what a serious editor would want.
    What would you do with link transit information? How do you "elaborate" a link? The best use of it I can think of is to pick a small set of related articles, and draw pretty graph pictures. A noble goal, to be sure, but it would require a change to the program below to generate such data efficiently. -- Tim Starling 06:30, 6 Sep 2004 (EEST)
    Pictures aren't really the point. What if one were able to simply see the number of times that a link was transited in the past month, as a superscripted number ("exponent") on that link? Then it would just fit in the regular page display. It would give some idea of the typical paths followed through pages.

    This matters because I need to know what the output format should be, and I need to have some way to justify using server resources to generate such data.

    A map of nodes/pages with the number of link transits on each edge, such edges representing a link, is the obvious display. But that would be huge so one must be able to filter down to a very small number of pages and links that hold them together, typically the most heavily clustered / deeply connected to each other. Xerox PARC did some research on this about twenty years ago.

    Anyway, following is the result of a couple of hours of procrastination. -- Tim Starling 11:26, 3 Sep 2004 (EEST)

    It looks like C to me and it looks like that the Main() takes standard httpd.log as input. I'll run this on our logs sometime when I have the time. Kinda busy now. --Juxo 13:02, 3 Sep 2004 (EEST)
    It's C++. It outputs two sections separated by a double linefeed. The first is an indexed list of URLs. The second has three values on each line: index from, index to and the transit count. The idea is that you would read all this into a relational database with an index on all three columns, then perform whatever analysis you need to perform. -- Tim Starling 07:13, 4 Sep 2004 (EEST)
    #include <string>
    #include <iostream>
    #include <vector>
    #include <map>
    using namespace std;
    
    #define LINE_BUF_SIZE 1000
    
    #define REPORTING_INTERVAL 10000
    
    int getUrlIndex(char* s);
    
    class char_order
    {
    public:
    	bool operator()(char* s1, char* s2) 
    	{
    		return strcmp(s1, s2) < 0;
    	}
    };
    
    typedef map<char*, int, char_order> char_map;
    
    typedef char_map::iterator hash_iterator;
    typedef vector<map<int, int> >::iterator vectormap_outer_iterator;
    typedef map<int, int>::iterator vectormap_inner_iterator;
    
    vector<map<int, int> > outbound;
    vector<char*> urls;
    char_map urlHash;
    
    int main(int argc, char** argv) {
    	FILE* file;
    	if (argc == 1) {
    		file = stdin;
    	} else if (argc == 2) {
    		file = fopen(argv[1], "r");
    		if (!file) {
    			printf("Can't open file %s\n", argv[1]);
    			return 1;
    		}
    	} else {
    		printf("Incorrect argument count\n");
    		return 1;
    	}
    
    	
    	char buffer[LINE_BUF_SIZE];
    	int numLines = 0;
    	
    	while (!feof(file)) {
    		numLines = (numLines+1)%REPORTING_INTERVAL;
    		if (numLines == 0) {
    			fprintf(stderr, ".");
    			fflush(stderr);
    		}
    		if (!fgets(buffer, LINE_BUF_SIZE-1, file)) {
    			break;
    		}
    		
    		// Find start of quoted method/URL string
    		char* method = strchr(buffer, '"');
    		if (!method) {
    			continue;
    		}
    		method++;
    
    		// Find end of method, and start of URL
    		char* url = strchr(method, ' ');
    		if (!url) {
    			continue;
    		}
    		*url = '\0';
    		url++;
    
    		// Find end of URL
    		char* referrer = strchr(url, ' ');
    		if (!url) {
    			continue;
    		}
    		*referrer = '\0';
    		referrer++;
    		
    		// If URL does not contain "wiki", skip
    		if (strstr(url,"/wiki/") == NULL) {
    			continue;
    		}
    		
    		// Find start of referrer
    		referrer = strstr(referrer, " \"");
    		if (!referrer) {
    			continue;
    		}
    		referrer += 2;
    
    		// Find end of referrer
    		char* end = strchr(referrer, '"');
    		if (!end) {
    			continue;
    		}
    		*end = '\0';
    		
    		// Obtain indexes
    		int from = getUrlIndex(referrer);
    		int to = getUrlIndex(url);
    
    		// Add to matrix
    		if (outbound.size() < from+1) {
    			outbound.resize(from+1);
    		}
    		outbound[from][to]++;
    	}
    
    	// Output URLs
    	int numUrls = urls.size();
    	for (int i=0; i<numUrls; i++) {
    		printf("%d\t%s\n", i, urls[i]);
    		delete[] urls[i];
    	}
    	printf("\n");
    	for (int i=0; i<outbound.size(); i++) {
    		map<int,int> & row = outbound[i]; 
    		for (vectormap_inner_iterator j=row.begin(); j!=row.end(); j++) {
    			printf("%d\t%d\t%d\n", i, j->first, j->second);
    		}
    	}
    	return 0;
    }
    
    
    int getUrlIndex(char* s)
    {
    	int index;
    	hash_iterator iter = urlHash.find(s);
    	if (iter != urlHash.end()) {
    		index = iter->second;
    	} else {
    		// Copy string to the heap
    		int length = strlen(s)+1;
    		char* newMem = new char[length];
    		memcpy(newMem, s, length);
    
    		// Add to the containers
    		urls.push_back(newMem);
    		index = urls.size() - 1;
    		urlHash[newMem] = index;
    	}
    	return index;
    }
    

    from Wikipedia Village pump discussions 28 August to 1 September 2004

    Does anybody know where I can download a few days worth of Wikipedia hit log files? I'd like to do some analysis on them, but I can't figure out where they're kept. Thanks, --William Pietri 00:30, 28 Aug 2004 (UTC)

    We don't have daily log files. We have log files for the current month, updated nightly. If you take the difference from one nightly log file to the next, you can calculate the day's hits. →Raul654 00:37, Aug 28, 2004 (UTC)

    Ah, I was looking for the raw log files, with Referer information. I was curious do some analysis on Wikipedia usage patterns, including how people come in via search engines and what they do after that point. But I gather you're saying that raw hit logs just aren't collected on a regular basis? In that case, are there older ones around, perhaps collected during an optimization run? --William Pietri 00:44, 28 Aug 2004 (UTC)

    The raw logs are available only to wikimedia's developers. Publishing them would be a major issue, as in general we view people's IPs, the sites and pages they visit, and what they do here to be private information. A lot of people would be very pissed off if the raw logs were available to the casual visitor. I suppose if there were such a thing as a decent unified logfile sanitiser (which removed/abstracted IP information) then publishing that would be an option. But if I were running such a sanitiser, I'd surely have it purge the referrer information too, as (occasionally) there's juicy info in the referring URL (particularly when the referrer is an email in a webmail service, and sometimes a search query). That said, there doesn't seem to be such a program anyway (I guess I don't know what it would be called, so it's rather hard to google for it). The problem of examining the log files is compounded by wikimedia's server setup - most pages are served ex-cache by one of the frontline squid servers; so we'd need to publish a number of the (huge) squid logs too. It would indeed be a very interesting exercise to run some analyses of the logs, as there's all kinds of things we don't know about how visitors and search engines enter, move around, and leave the site. If we could figure out an acceptable way to do things then there is much value to be mined. -- Finlay McWalter | Talk 00:58, 28 Aug 2004 (UTC) Right, logs are private. You can find some statistics extracted from log files, including a list of referrers, on the Webalizer stats page (http://wikimedia.org/stats/en.wikipedia.org/). -- Tim Starling 01:28, Aug 28, 2004 (UTC)

    Interesting. I hadn't thought about the privacy issues, but they make complete sense to me. If I get enough time to do the research properly, I'll contact y'all with a proposal to either A) write a log sanitizer that meets your standards, or B) write an analysis program that you guys run, so I never see the log data. Also, thanks for the link to the Webalizer pages; I had missed those the first time through. Thanks, --William Pietri 20:54, 30 Aug 2004 (UTC) When you write your log file analysis program, keep in mind that the squid logs for each day come to about five gigabytes. Also, depending on how much spare CPU time there is, running the analyzer might be done only intermittently, as has been the case with the Webalizer statistics. It will be interesting to see the results of this. Until now, everybody wanting to do this has been all talk and no code. -- Jeronim 05:26, 1 Sep 2004 (UTC)