Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An open-source, self-hostable solution providing 80% of common Google Analytics functionality seems doable to me.

Is there anything out there in this realm? If not, why not?



Have a look at piwik: http://piwik.org/



Unbelievable. Unsalted MD5, no less. There's an issue to fix this that's been open for seven years! https://github.com/piwik/piwik/issues/5728


Eh. The analytics data is pretty low value as far as hacker targets, and this can be mostly mitigated anyways by sane segregation of the admin backend from the publicly accessible site.

There's an open ticket for it, but it looks like it hasn't been addressed in a while since they don't want to break all existing passwords.

https://github.com/piwik/piwik/issues/5728



A low value target maybe, but having a critical security ticket open for seven years is unforgivable. If they don't want to break compatibility it's pretty simple: use something like PHPass and upgrade the hash when the user next logs in. i.e. what every halfway sensible web app did at least five years ago.


It does not have to break all existing passwords. Just add an envelope for the old passwords.


There's a $555 bounty if you can demonstrate a security vulnerability in Piwik because of that.


I'm not interesting in further dehumanizing myself with participation in a bug bounty program.

I'll write an exploit for it (the general case, not just Piwik in particular) and drop it on OSS Sec some day, but here's a theoretical attack:

1. Guess a username somehow. Maybe "admin"? Whatever, we're interested in the security of the hash function. Let's assume we have the username for our target.

2. Calculate a bunch of guess passwords, such that we have one hash output for each possible value for the first N hexits.

e.g.

    substr(md5($string), 0, 2) === "00"
    substr(md5($string), 0, 2) === "01"
    substr(md5($string), 0, 2) === "02"
    // ...
    substr(md5($string), 0, 2) === "ff"
3. Send these guess passwords repeatedly and use timing information to get an educated guess on the first valid MD5 hash.

4. Iterate steps 2 and 3 until you have the first N bytes of the MD5 hash for the password.

5. Use offline methods to generate password guesses against a partial hash.

The end result: A timing attack that consequently allows an optimized offline guess. So even if their entire codebase is immune to SQL injection, you can still launch a semi-blind cracking attempt against them.


By the way, if anyone else wants to try to claim the $555 from Piwik based on the above theoretical attack, feel free.


How to protect from timing attacks - It's All About Time: http://blog.ircmaxell.com/2014/11/its-all-about-time.html


password_verify() compares hashes in constant-time, so, yeah...


http://snowplowanalytics.com/ is worth considering if you have larger volumes of traffic


Snowplow is great. Super scalable and they have instructions on how to host everything on your own AWS infrastructure.


+ 1 for snowplow. We been using it for more than a year now with high traffic sites and it's working great.


What do you use as a front-end?


Why not for low volume sites?


Well it takes lots of time to setup + needs a few extra server like event collector, log cleanup and enrich, data loading and database server.


Yes, there is: www.piwik.org


Self hosted analytics is still centralizing behavioral data on the users. It's not really any better from a privacy standpoint than GA.


Its nowhere near as centralized as Google Analytics though - at least if you're self hosting that data is confined to the silo which is your own analytics, rather than Google being able to aggregate that with their behaviour on every other site they visit as well.


That silo is still aggregating data. Trying to argue its "less" centralized by using quantification of the amount of centralization is still akin to dissonance. Clearly people here don't agree with this, but that's to be expected when the topic is so polarizing. Traffic analytics must be important, so we rationalize our actions, or inactions around how we collect them.

Any centralized solution, at any scale, can possibly violate someone's privacy. Period. If we want to really fix things, we should stop circle jerking ourselves and do something about it.


Not at all. The entire point is that Google is able to track one person across many, many sites. That is simply not possible if each site had its own self-hosted analytics.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: