Just thoughts

Friday, May 21, 2010

Validating an URL in Java

Short and to the point. You can do it in two ways: hard and simple. Hard is when you come up with a 600 byte long regex pattern and your pattern fails when your user inputs something you haven't thought of, or you can go the simple way and test if the URL is reachable from java.net's point of view. If it's not, you can't use it anyway.

Java.net throws an exception if the URL isn't valid, but you don't have to use the thrown message, who said you have to?
//...
import java.net.MalformedURLException;
import java.net.URL;
//...

private boolean isValidURL(String a) {
        try {
            URL url = new URL(a);
            return true;
        } catch (MalformedURLException e) {
            return false;
        }
}

The usage is... very simple:

if(isValidURL(userInput)){
     //Do something
}else{
    // URL is invalid
}

I Googled quite a lot for a better way which is simpleminded like me but no luck. So that method is actually as simple as my mind...

Regex matching code comments - Java

Okay. So I hate Regex for some reason, yet have to use it cos... cos I have to. I spent my whole day trying to figure out a Regex pattern which would match any code comment. I want to strip them off  (i.e. replaceAll(pattern, "")) from CSS files cos they're useless, they take up unnecessary space and bandwidth.

In English. I want this:


/*This is my comment
spanning across multiple lines*/
body {background-color: #000}

to become this:


body {background-color: #000}

Easy, isn't it? It's not that easy as it turned out. I came up with this pattern first; I think this was the child of my own brain, but after all the hours spent trying to find a working pattern, I really don't remember:

(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)

Humm. Looks OK, isn't it? It's not OK. Here's this multiline comment:


/* This is my comment
*  spanning across multiple lines
*  and having asterisks at every new line
*  cos that's cool
*/

If I use (?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*) like presented below on the above comment block, I will get an awesome stack overflow error (read, I get a nice HTTP500 on Ant):


public static String compress(String s){
  s = s.replaceAll("(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)", "");
  return s;
}

Guess I get an infinite loop.

So I need a better pattern else I'll get SO errors every once in a while. I'm not sure where but I got this pattern:

//.*|(\"(?:\\\\[^\"]|\\\\\"|.)*?\")|(?s)/\\*.*?\\*/

This one works surprisingly well for now. Guess this also has some limitations I'm not aware of right now but at least I don't get SO errors.

Update:
Okay, so it strips relative URLs as well. Awesome :/

Labels: ,

Wednesday, May 19, 2010

Google Appengine- free service, not good for anything

At least for me. I didn't have too much mood to try it out but since everyone was so "wow" about it, why not.
But what should I do, what should be the first project?! Let's create a chat!

GAE Chat
What can be so hard in creating a chat on Appengine using Python? Ajaxy GUI, Google Single Sign-on, free, fast database... it should rock on, shouldn't it? I tell you what, it won't.
If you're not clever enough, even with two users chatting you hit a limit. Not Datastore calls, not number of requests... but CPU Time! You have the fastest servers you can imagine and what happens when you run an AJAX chat? You hit the wall cos you have such a generous quota set.

GAE Audio/Voice Chat
This is neither hard considering that Java has a massive, powerful Sound API. In fact, you can create such a service with not more than 50 lines of Java call and two classes.
Hoho... hold it back; it's not so easy if you want to use GAE! You didn't observe that GAE has no support for the Sound API, did ya? Creating an Audio Chat on Appengine = Failure.

GAE Crawler
How about automated crawling of a forum? I want stats, I have to extract data from URLocations. In PHP 100 lines of code the most, with database CREATE calls altogether so with the powerful Python or Java it should be a piece of cake. Especially since you have such a simple GQL syntax.
Nah, you won't automate your crawler... you hit a frigging limit again! It turns out that Google allmighty ain't likes too many content on forums... there are too many URIs to be crawled. So what, just damn crawl it.

GAE URLFetch


import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLEncoder;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import javax.servlet.http.*;

@SuppressWarnings("serial")
public class html_fetcherServlet extends HttpServlet {
    private String line;

    public void doGet(HttpServletRequest req, HttpServletResponse resp)
            throws IOException {
        resp.setContentType("text/html");
        try {
            if(req.getParameter("uri") != null){
                URL url = new URL(req.getParameter("uri"));
                HttpURLConnection connection = (HttpURLConnection) url.openConnection();
                connection.setDoOutput(true);
                connection.setRequestMethod("GET");
                connection.setRequestProperty("User-Agent", "My UA");

                 String urlenc = URLEncoder.encode(req.getParameter("uri"), "UTF-8");
                 connection.setRequestProperty("Referer", "http://www.google.com/url?sa=t&source=web&ct=res&cd=7&url=" + urlenc + "&ei=0SjdSa-1N5O8M_qW8dQN&rct=j&q=flowers&usg=AFQjCNHJXSUh7Vw7oubPaO3tZOzz-F-u_w&sig2=X8uCFh6IoPtnwmvGMULQfw");

                connection.setInstanceFollowRedirects(false);

                BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
                Map responseMap = connection.getHeaderFields();
                int i = 1;
                                     
               for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();) {
                    String key = (String) iterator.next();
                    resp.getWriter().println(key + " = ");

                    List values = (List) responseMap.get(key);
                    for (int i1 = 0; i1 < values.size(); i1++) {
                        Object o = values.get(i1);
                        resp.getWriter().println(o + "");
                    }
                }
                resp.getWriter().println("
");
                while ((line = reader.readLine()) != null) {
                    resp.getWriter().println(i + " " + escapeHtmlFull(line));
                    i++;
                }
                resp.getWriter().println("
");
                reader.close();
            }
        } catch (MalformedURLException e) {
            resp.getWriter().println(e.getMessage());
        } catch (IOException e) {
            resp.getWriter().println(e.getMessage());
        }
    }


public static StringBuilder escapeHtmlFull(String s){
}


Simple, isn't it? Set some headers, referer and user-agent then get the remote page... hard? No.
Google says... Hold on Sparky! You won't do that, will ya?! No I won't cos the allmighty doesn't let me to do it. Better said, it does, but it frigging appends its own User-Agent to mine just to show off how cool we are! I tell you what: I need my own user agent cos it's my frigging app, not yours! The app has to have correct user agent set by me, else it has no meaning creating it. And when do I learn about that you'll append your crap to my crap? On the production server; locally works fine.

Did I finish whining? Nope.

GAE Sitemap checker
How about uploading some XSDs as static files with my app and let the webmasters test their sitemaps against those XSDs? It's a cool app and is for the humanity.
Not so fast (again)! You can't open your own files for READING through Java on Appengine (probably with Python neither). Your own files...


So, why I'm not paying for this service and have some limits lifted?
After all these experiences? Hell no. I was told it's free. If I will pay for a cloud computing service, that will be EC2. Just because they didn't let me fall in false assumptions like I get top-notch service for free...

Monday, May 17, 2010

A great interview... FAIL

So, was lurking at my "awesome" Twitter feed and saw a tweet by JohnMu (http://twitter.com/JohnMu/status/14165538484). He points to an interview created by Lee Odden on TopRankBlog.com. The subject is  Maile Ohye, one of my favorite Googlers to date (she's points out great things and she's even funny); the problem is not with Maile, she did a great job answering the questions, the problem is with the questions.

When you do an interview, please, for the humanity's sake, don't do it with template questions. The interviewer had a bunch of unrelated questions created in the office, written on a paper and he asked these template questions. This is bad. It's unnatural.

A good interview should be a dialog between the interviewer and the subject of the interview. A chat. You may have some questions written on the paper, but you should just flow with the subject and use your pre-made questions if the chat hits a wall.

For example, Lee asks:

Google Webmaster Central has been a great resource for many webmasters. What tips can you share with web site owners to make the most out of Google Webmaster Tools?
Maile answers:
Awww, Webmaster Central a “great resource” for many webmasters? That’s wonderful to hear. As for tips, I’d say verify ownership of your site in Webmaster Tools, sign up for email forwarding in Webmaster Tools’ Message Center, and then check out all the specific data for your site: our Top Search Queries feature was just revamped. Crawl Errors is cool for making sure your site is accessed as you’d expect (many people find unknown 404s, or realize they have server downtime because of noticing the “Unreachable” errors), HTML Suggestions shows you the URLs with duplicate titles or meta descriptions. I think once you start poking around in Webmaster Tools you’ll learn more and more. It’s addictive.

then Lee asks a totally unrelated question from Maile:
How does one become a Bionic Poster?
Come on, would you ask that considering Maile's answer to the previous question if you'd chat with her next to a coffee in Starbucks? You certainly wouldn't. You would ask something like "What if the webmaster hits the wall, if he needs help?", and she'd reply that there's a very helpful forum where they can ask for help and the bionics would likely help the webmaster. Then you can ask about how does one become a bionic.

Just saying...

I hate HBO

It's not because it can't deliver top notch movies, it certainly can. It looks good, it offers HD (well, sort of) full stereo and sometimes surround which drives mad my neighbors... but still, it's a messed up channel.

It happens that I love Lord of the Rings (LotR), I really do. It's an awesome movie from many points of view, but every episode is frigging long; you're expected to sit and watch the full movie in one shot, but I tell you, it's a pain! A movie is good when it can deliver some action in every seven minutes. That's required for people to not fall asleep. BUT! There are movies (LotR) which takes this to extreme and shows some really neat stuff between these seven minutes, or the "Wake the hell up" action takes exactly 6'59" and then continues with another "Wake the hell up" action. This is bad! No, it's Evil!

  1. you have to go to the bathroom. Shit happens sometimes, and this time it happens to happen exactly in the middle of a "Wake the hell up" action. How do you go to the bathroom? Will you go and miss the action or you try to abstain yourself from the pleasure causing pain for yourself?
  2. you're hungry; that's quite common nowadays in the middle (or end, heck knows) of the financial crisis. Will you go to eat something after you take in consideration that the fridge is downstairs and you have no idea if you will find anything in it, or you continue to watch the movie and starve, causing you pain?
  3. let's say you're a smoker and your girlfriend(s) won't allow you to smoke in the room? What do you do? Miss the action or go out on the balcony and have pleasure, knowing that you will die sooner?
  4. you forgot to turn down your PC speakers and a weird sound notifies you that you got a new email. Of course, the PC is downstairs and you expect an email from the CEO of MSFT as both of you love puppies and you've engaged yourself in a cool conversation with him about these little, weird animals (due to lack of other, meaningful subject)
Now, every time I watch something on HBO, these things happen to me... always! How about HBO implements something like a "Pause" function or something? It won't happen anytime soon, will it? I hate HBO...

Sunday, May 2, 2010

Shalom, fh6whUq3NnsPfj8g3vr0gQO4Yyzf.com

So I got a wiki, not a big deal but still good for people to share information between them. A nice thing about wikis is that people can share information without actually signing in, their IP address is logged but in rest, they can do whatever they want. The bad thing about wikis is that people can do whatever they want, without signing in. Combine the bad and good and you get a controversy.

Wikis are the heaven of spammers and vandals, this is a known fact. Spam is a weird thing, but if the spammer links to an nonexistent domain, that's even weirder. This happened on the aforementioned wiki. After checking with a mysterious John, he had the idea to register the domain name and watch what happens, how many visitors the domain gets, how much AdSense revenue can one earn with such a domain and so on. Below you will find all the details.

The spammer

On the wiki the IP address of the spammer when the spammy content was placed was 114.80.67.252. This IP address was tracked  back to China and seems to be used by the Shanghai Minhang Cancer Hospital.
The content of the spam was similar to:
yjfgv http://fh6whUq3NnsPfj8g3vr0gQO4Yyzf.com
The first part of the comment is much likely a variable unique for the site the comment was placed on and is used by the spammer for tracking which websites allow comments through without moderation.

When we first observed the link, Google knew about roughly 960-970 web pages where the string figured. At the time of this post Google knows about 49,000.

Traffic drill down

In the course of one week, the website received traffic mainly --obviously-- from referring websites. In total there were 6120 unique request. In 289 cases the HTTP response code was "304 (not modified)" which means that someone from that specific computer and browser already visited the website.
In 164 cases the user was referred from a WordPress admin back-end's Akismet spam viewer which means that some blog admins have the habit of visiting the URLs from the already caught spam.
From the total of 6120 unique requests 613 were referred by Live Mail and Yahoo! Mail. Since mailboxes weren't targeted by the spammer, this likely means that people received a notification from the publishing platform and clicked through from within the spammy mail. Noteworthy, that Gmail referred only one single visitor.

  • 2010-04-11, 18:36 - 0 (website went online)
  • 2010-04-12, 18:35 - 4515 
  • 2010-04-13, 18:37 - 798
  • 2010-04-14, 18:38 - 318
  • 2010-04-15, 18:16 - 185
  • 2010-04-16, 18:31 - 168
  • 2010-04-17, 18:31 - 136


User-Agent strings
  • iPhone        - 142
  • Chrome      - 330
  • MSIE         - 2104 (IE6: 677, IE7: 519, IE7:899)
  • FireFox      - 1053
  • Opera        - 90
  • Blackberry - 33
  • Safari         - 302
  • Android     - 11


Crawler Visits

Yandex's crawler visited the homepage the first time at 11th of April at 2350 GMT. It was followed by Googlebot on 12th, 0045 GMT then Yahoo's Slurp the same day at 0625 GMT. The last was Archive.org's crawler, which visited the homepage on 16th of April at 1357 GMT.
In the one week test period none of Ask's or Bing's crawlers visited the homepage.

AdSense Stats

In one week the total earnings through AdSense was about 2 USD. The click-through-rates (CTR) were relatively normal for a page where are no other links to click.
Below is the AdSense drill down containing the date, number of clicks and CTR.
  • 2010-04-11   0   0  (website went online)
  • 2010-04-12   8   2.29%
  • 2010-04-13   7   1.87%
  • 2010-04-14   3   1.66%
  • 2010-04-15   5   4.03%
  • 2010-04-16   1   0.8%
  • 2010-04-17   0   0%

Publishing Platforms

In the very first days of the domain, a list of publishing platforms on which the spammy comment appeared was compiled. 69 percent of the comments were left on custom coded guest books or blogs. The rest of the spam was left on some sort of well established publishing platform:
  • WordPress - 15%
  • Joomla       - 3%
  • Mediawiki  - 2%
  • Serendipity - 2%
  • phpBB       - 2%
  • Other         - 7%
In 13% of  the comments left by the spammer, the URL was turned into link by the CMS automatically. In 23% of the cases when a link appeared, the "nofollow" microformat was not used.