Regex

testing935
7 years ago

0

I need help Guy !
My Programm must find “?page=hjklkjjmnj”
But he cann’t print the result

See my code


        test = self.LFItest  
        adresseHTTP = self.site  

        socker = requests.get(adresseHTTP)  
        parser = URLLISTER()  
        parser.feed(socker.text)  

        TabLink = []  
        TabLinkValid = []  

        x = 0  
        print color.green + "\n[*] Load link of the site...\n"  

        print color.red + "\n\t\t LINKS \n"  
        for url in parser.urls :  
            print "["+ str(x) + "] " + url  
            TabLink.append(url.encode("utf-8"))  
            x += 1  

        print color.blue + "\nClean intern link of site..."  
        string = " ".join(TabLink)  

        print string  

        re1 = '.*?'  
        re2 = '(\\\'.*?\\\')'  
        rg = re.compile(re1+re2, re.IGNORECASE|re.DOTALL )  

        m = rg.search(string)  

        if m:  
            result = m.group()  
            print "\n\n(" + result + ")" + "\n"  

Normally he sould print the variables “result”
But nothing happen

thank in advance for your answers

21replies
6voices
263views
L00PeR
7 years ago

0

for using requests, you must first create a session:
s = requests.Session()
Posting the error that python returns to you may be also helpful

dloser
7 years ago

1

@L00PeR**: No, it is not required to use a session. And I suspect there is no error because the problem is that condition of the final if is None.

@testing935: Some more tips on asking questions: Do not include all this irrelevant code. As far as I can tell your question is only about the regular expression. So what is the point of the code? It only distracts from the question (as demonstrated by @L00PeR).

Also, be clear about what you want. You say you want to find “?page=hjklkjjmnj”. Do you mean exactly that string (so not “?page=hjklkjjknj”) or do you mean the query part of a URL (everything after ?) or only the first part of the query if it starts with ‘page=’ or …

Finally, try out regexs by playing around with one of those online regex tools that shows you exactly how your regex works and what matches what.

Looking at your regex, I have seriously no idea how it relates to the string you want to match. Why are you matching (single) quotes? Why all the unnecessary escaping and operators?

Punkachu
7 years ago

0

I agree with Dloser, there is a lack of information and details in your question.
Especially when you are playing around with Regular expressions, oftenly , the websites keep the same global same structure in the source code (Following the web site’s template).

You need to have a deep loock at the structure to make sure you will find the good data. Moreover keep in mind that Website still remain dynamic so you may also provide a check functions to make sure the website has not changed (Very important in Web crowling).

But I will give you a useful , use the following functions to get all links from any website :

def get_links(html): """Return a list of links from html """ # a regular expression to extract all links from the webpage webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE) # list of all links from the webpage return webpage_regex.findall(html)

findall —> return an array of string of course
You already have the good import , if you want more help, just give more details.

SIGKILL [r4v463]
7 years ago
Punkachu
7 years ago

0

Actually r4v463 , parsing html with regexp is 10 times faster than other modules such like Scappy
The automates algorithm are a way better than the heavy plugin are modules, but more difficult to handle.
So I think it worth it.

SIGKILL [r4v463]
7 years ago | edited 7 years ago

0

If it’s 10 times faster but doesn’t work it hardly worth something. If you write your own HTML following a clear syntax and without mistaking, then you can parse HTML with regex, but if you have a complex and non-standard (which by nature HTML is because it’s very permissive in its syntax) structure, then your regex will fail. For example, what if there is a ‘ or a “ in your URL? Your regex will fail because it will take only the part before the first ’ or ” of the URL.

Punkachu
7 years ago

0

That’s why I said it is hard to handle very well but if you greatly know how to use it , your example won’t be that difficult to handle.
Regular expressions are amazingly powerful.
I always use it to scrap data over internet and I never fail.

testing935
7 years ago | edited 7 years ago

0

re,
I will explain my problem more clearly.
I want my program to parse an HTML source code so that it can output essential links such as “?Page=poney” or as “?File=yolo”, so I have a function that will save all the links (Internal or external to the site whatsoever) and not only keep the links that correspond to what I’m looking for, that is to say the string “? Page=edkusuhf” or “?File=iushdfciosi” or “?ksuhf=skegdkeh ”

Here is the problem: I use a function that will find all the links of a page and then save them in a table, then, with the regex, I read the links to match the type of link that I seek. .. except that when it needs to print the corresponding links it does not display anything.

The online tool I use is:
http://txt2re.com/index-csharp.php3

SIGKILL [r4v463]
7 years ago

0

I am strictly against parsing HTML with regex, but do what you want. You can make your own regex for python at this link. It’s way better to do your own regex, and in a CSS point of view it’s far less agressive than your website.

SIGKILL [r4v463]
7 years ago

0

I finally find it. To convince you not to use regex, I will quote @dloser aka the master of the 11th level:
[quote=dloser]And this is why I usually quote the following: Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems[/quote]

here is the thread

Punkachu
7 years ago

0

Lol I don’t need quotation to judge a technology , regexp are very efficient for scrawling huge amount of data regarding execution performance and still it also depend greatly on the data appearance.
At leat I know every step and I can easily make optimization.

BUt yeah , practicing is the key of everything.

Okay back to you : Testing935.
Would you give me your complete code ?
Your website is only for executing regexp .
If it print nothing it’s just mean your regexp is wrong regarding the needed data.

Punkachu
7 years ago

0

Why don’t you use the function I gave you then make post treatment on it ? I don’t understand the need of your ragxp

And be careful to not abuse of * –> it is a strong wildcard, and can eat some unexpected char.

Punkachu
7 years ago | edited 7 years ago

0

would you mind to give me your targeted website I am gonna do this for you.

edit:
Your regexp are not accurate at all.

SIGKILL [r4v463]
7 years ago

0

@Punkachu Do you know what DOM is?

dloser
7 years ago

0

[quote=Punkachu]Lol I don’t need quotation to judge a technology…[/quote]
But perhaps you do need some knowledge. There are some fundamental limitations to what you can do with regular expressions, regardless of practice and aside from readability issues. Regular expressions can be useful in some situations, but as soon as things get a bit complex, there are usually better options. The quote is a result of this realisation, so perhaps see it more as a suggestion to try and understand why someone would say that. ;)

Also, the performance of regular expressions is due to the fact that they can be implemented in a certain way, but that’s not just the case for regular expressions. And how important is that here anyway? I’m assuming tested935 is not running his code on Google’s datasets. :p The term “premature optimisation” comes to mind.

I feel this is kind of a case of “if all you have is a hammer…”

Now, knowing this and realising that in these non-production types of code it’s ok to be not 100% correct or accurate, you can of course use imperfect regular expressions like the one you posted. For quick hacks, I often assume that my input is of a certain format so that I can just do horrible things like x.split()[-4].split(‘“’)[1]. But I only do it knowing that it is bad and that it doesn’t really matter in that specific case. (Although, I might later think otherwise when revisiting the code… ;))

Punkachu
7 years ago

0

You right I give him quick answer because I don’t really get his goal.

You right, i gave him answer regarding my own previous project wich need faster execution time and huge amount of data to parse.

An yeah I admit for primary goal other python modules are perfectly suitable .

:)

Punkachu
7 years ago

0

Test963
Do yoou need help ? Can we go back to your problem ?
PM me otherwise I speak french like you I guess.

Then close the thread otherwise.

Mugi [Mugiwara27]
7 years ago

1

[quote=dloser]
I can just do horrible things like x.split()[-4].split(‘“’)[1].
[/quote]

huh, that’s not horrible!

dloser
7 years ago

1

[quote=Mugiwara27]huh, that’s not horrible![/quote]

I think that makes you horrible. ;)

SIGKILL [r4v463]
7 years ago | edited 7 years ago

1

If you want true horror, I saw this on Twitter earlier today:
x="if(t%2)else";python3 -c"[print(t>>15&(t>>(2$x 4))%(3+(t>>(8$x 11))%4)+(t>>10)|42&t>>7&t<<9,end='')for t in range(2**20)]"|aplay -c2 -r4

I didn’t even try to understand what it does.

Mugi [Mugiwara27]
7 years ago

0

[quote=dloser]
I think that makes you horrible. ;)
[/quote]

ahhhh, I almost missed these kind words of dloser!

You must be logged in to reply to this discussion. Login
1 of 22

This site only uses cookies that are essential for the functionality of this website. Cookies are not used for tracking or marketing purposes.

By using our site, you acknowledge that you have read and understand our Privacy Policy, and Terms of Service.

Dismiss