Regex
I need help Guy !
My Programm must find “?page=hjklkjjmnj”
But he cann’t print the result
See my code
test = self.LFItest
adresseHTTP = self.site
socker = requests.get(adresseHTTP)
parser = URLLISTER()
parser.feed(socker.text)
TabLink = []
TabLinkValid = []
x = 0
print color.green + "\n[*] Load link of the site...\n"
print color.red + "\n\t\t LINKS \n"
for url in parser.urls :
print "["+ str(x) + "] " + url
TabLink.append(url.encode("utf-8"))
x += 1
print color.blue + "\nClean intern link of site..."
string = " ".join(TabLink)
print string
re1 = '.*?'
re2 = '(\\\'.*?\\\')'
rg = re.compile(re1+re2, re.IGNORECASE|re.DOTALL )
m = rg.search(string)
if m:
result = m.group()
print "\n\n(" + result + ")" + "\n"
Normally he sould print the variables “result”
But nothing happen
thank in advance for your answers
Add me on hackthis
Or speak with me on IRC channel : https://www.hackthis.co.uk/irc/
Try to Ddos me, My ip : 127.0.0.1
for using requests, you must first create a session:
s = requests.Session()
Posting the error that python returns to you may be also helpful
Human Stupidity , thats why Hackers always win.
? Med Amine Khelifi
@L00PeR**: No, it is not required to use a session. And I suspect there is no error because the problem is that condition of the final if is None.
@testing935: Some more tips on asking questions: Do not include all this irrelevant code. As far as I can tell your question is only about the regular expression. So what is the point of the code? It only distracts from the question (as demonstrated by @L00PeR).
Also, be clear about what you want. You say you want to find “?page=hjklkjjmnj”. Do you mean exactly that string (so not “?page=hjklkjjknj”) or do you mean the query part of a URL (everything after ?) or only the first part of the query if it starts with ‘page=’ or …
Finally, try out regexs by playing around with one of those online regex tools that shows you exactly how your regex works and what matches what.
Looking at your regex, I have seriously no idea how it relates to the string you want to match. Why are you matching (single) quotes? Why all the unnecessary escaping and operators?
I agree with Dloser, there is a lack of information and details in your question.
Especially when you are playing around with Regular expressions, oftenly , the websites keep the same global same structure in the source code (Following the web site’s template).
You need to have a deep loock at the structure to make sure you will find the good data. Moreover keep in mind that Website still remain dynamic so you may also provide a check functions to make sure the website has not changed (Very important in Web crowling).
But I will give you a useful , use the following functions to get all links from any website :
def get_links(html):
"""Return a list of links from html """
# a regular expression to extract all links from the webpage
webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
# list of all links from the webpage
return webpage_regex.findall(html)
findall —> return an array of string of course
You already have the good import , if you want more help, just give more details.
Wait a minute. Don’t parse HTML with Regex
If it’s 10 times faster but doesn’t work it hardly worth something. If you write your own HTML following a clear syntax and without mistaking, then you can parse HTML with regex, but if you have a complex and non-standard (which by nature HTML is because it’s very permissive in its syntax) structure, then your regex will fail. For example, what if there is a ‘ or a “ in your URL? Your regex will fail because it will take only the part before the first ’ or ” of the URL.
re,
I will explain my problem more clearly.
I want my program to parse an HTML source code so that it can output essential links such as “?Page=poney” or as “?File=yolo”, so I have a function that will save all the links (Internal or external to the site whatsoever) and not only keep the links that correspond to what I’m looking for, that is to say the string “? Page=edkusuhf” or “?File=iushdfciosi” or “?ksuhf=skegdkeh ”
Here is the problem: I use a function that will find all the links of a page and then save them in a table, then, with the regex, I read the links to match the type of link that I seek. .. except that when it needs to print the corresponding links it does not display anything.
The online tool I use is:
http://txt2re.com/index-csharp.php3
Add me on hackthis
Or speak with me on IRC channel : https://www.hackthis.co.uk/irc/
Try to Ddos me, My ip : 127.0.0.1
I am strictly against parsing HTML with regex, but do what you want. You can make your own regex for python at this link. It’s way better to do your own regex, and in a CSS point of view it’s far less agressive than your website.
I finally find it. To convince you not to use regex, I will quote @dloser aka the master of the 11th level:
[quote=dloser]And this is why I usually quote the following: Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems[/quote]
here is the thread
Lol I don’t need quotation to judge a technology , regexp are very efficient for scrawling huge amount of data regarding execution performance and still it also depend greatly on the data appearance.
At leat I know every step and I can easily make optimization.
BUt yeah , practicing is the key of everything.
Okay back to you : Testing935.
Would you give me your complete code ?
Your website is only for executing regexp .
If it print nothing it’s just mean your regexp is wrong regarding the needed data.
@Punkachu Do you know what DOM is?
[quote=Punkachu]Lol I don’t need quotation to judge a technology…[/quote]
But perhaps you do need some knowledge. There are some fundamental limitations to what you can do with regular expressions, regardless of practice and aside from readability issues. Regular expressions can be useful in some situations, but as soon as things get a bit complex, there are usually better options. The quote is a result of this realisation, so perhaps see it more as a suggestion to try and understand why someone would say that. ;)
Also, the performance of regular expressions is due to the fact that they can be implemented in a certain way, but that’s not just the case for regular expressions. And how important is that here anyway? I’m assuming tested935 is not running his code on Google’s datasets. :p The term “premature optimisation” comes to mind.
I feel this is kind of a case of “if all you have is a hammer…”
Now, knowing this and realising that in these non-production types of code it’s ok to be not 100% correct or accurate, you can of course use imperfect regular expressions like the one you posted. For quick hacks, I often assume that my input is of a certain format so that I can just do horrible things like x.split()[-4].split(‘“’)[1]. But I only do it knowing that it is bad and that it doesn’t really matter in that specific case. (Although, I might later think otherwise when revisiting the code… ;))
[quote=dloser]
I can just do horrible things like x.split()[-4].split(‘“’)[1].
[/quote]
huh, that’s not horrible!
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘\’‘ at line 1
[quote=dloser]
I think that makes you horrible. ;)
[/quote]
ahhhh, I almost missed these kind words of dloser!
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘\’‘ at line 1