python 2.7 - Scrape a table looping in specific dates using Beautiful Soup -


i have been driving myself wall trying scrape necessary historical coffee prices table found here using beautifulsoup: http://www.investing.com/commodities/us-coffee-c-historical-data

i trying pull market weeks worth of prices 04-04-16 04-08-2016.

my ultimate goal scrape entire table dates. pulling columns date change %.

my first step create dictionary of dates want, using date format of used in element:

dates={1 : "apr 04, 2016",   2 : "apr 05, 2016",   3 : "apr 06, 2016",   4 : "apr 07, 2016",   5 : "apr 08, 2016"} dates 

next want scrape table can't need loops dates in needed have tried pull individual elements:

import requests bs4 import beautifulsoup  url = "http://www.investing.com/commodities/us-coffee-c-historical-data" page  = requests.get(url).text soup_coffee = beautifulsoup(page)  coffee_table = soup_coffee.find("table", class_="gentbl closedtbl historicaltbl") coffee_titles = coffee_table.find_all("th", class_="nowrap")  coffee_title in coffee_titles:   price = coffee_title.find("td", class_="greenfont")   print(price) 

except value returned is:

none none none none none none none 

firstly, why returning "none" value? have feeling has coffee_titles part of code, , not recognizing column titles correctly.

secondly, there efficient way me scrape entire table using date range in dates dictionary?

any suggestions appreciated.

your code fails looking td tags in headers tags, if print coffee_titles, pretty clear why see none:

[<th class="first left nowrap">date</th>, <th class="nowrap">price</th>, <th class="nowrap">open</th>, <th class="nowrap">high</th>, <th class="nowrap">low</th>, <th class="nowrap">vol.</th>, <th class="nowrap">change %</th>] 

there no td tags.

to table data, can pull dates table , use them keys:

from bs4 import beautifulsoup collections import ordereddict  r = requests.get("http://www.investing.com/commodities/us-coffee-c-historical-data") od = ordereddict() soup = beautifulsoup(r.content,"lxml")  # select table table = soup.select_one("table.gentbl.closedtbl.historicaltbl")  # col names cols = [th.text th in table.select("th")[1:]] # rows bar first i.e headers row in table.select("tr + tr"):     # data including date     data = [td.text td in row.select("td")]     # use date key , store list of values     od[data[0]] = dict(zip(cols,  data[1:]))    pprint import pprint pp  pp(dict(od)) 

output:

    {u'jun 01, 2016': {u'change %': u'0.29%',                    u'high': u'123.10',                    u'low': u'120.85',                    u'open': u'121.50',                    u'price': u'121.90',                    u'vol.': u'18.55k'},  u'jun 02, 2016': {u'change %': u'0.90%',                    u'high': u'124.40',                    u'low': u'122.15',                    u'open': u'122.50',                    u'price': u'123.00',                    u'vol.': u'22.11k'},  u'jun 03, 2016': {u'change %': u'3.33%',                    u'high': u'127.40',                    u'low': u'122.50',                    u'open': u'122.60',                    u'price': u'127.10',                    u'vol.': u'28.47k'},  u'jun 06, 2016': {u'change %': u'3.62%',                    u'high': u'132.05',                    u'low': u'127.10',                    u'open': u'127.30',                    u'price': u'131.70',                    u'vol.': u'30.65k'},  u'may 09, 2016': {u'change %': u'2.49%',                    u'high': u'126.60',                    u'low': u'123.28',                    u'open': u'125.65',                    u'price': u'126.53',                    u'vol.': u'-'},  u'may 10, 2016': {u'change %': u'0.29%',                    u'high': u'125.90',                    u'low': u'125.90',                    u'open': u'125.90',                    u'price': u'126.90',                    u'vol.': u'0.01k'},  u'may 11, 2016': {u'change %': u'2.26%',                    u'high': u'129.77',                    u'low': u'126.88',                    u'open': u'128.60',                    u'price': u'129.77',                    u'vol.': u'-'},  u'may 12, 2016': {u'change %': u'-1.21%',                    u'high': u'128.75',                    u'low': u'127.30',                    u'open': u'128.75',                    u'price': u'128.20',                    u'vol.': u'0.01k'},  u'may 13, 2016': {u'change %': u'0.47%',                    u'high': u'127.85',                    u'low': u'127.80',                    u'open': u'127.85',                    u'price': u'128.80',                    u'vol.': u'0.01k'},  u'may 16, 2016': {u'change %': u'3.03%',                    u'high': u'131.95',                    u'low': u'128.75',                    u'open': u'128.75',                    u'price': u'132.70',                    u'vol.': u'0.01k'},  u'may 17, 2016': {u'change %': u'-0.64%',                    u'high': u'132.60',                    u'low': u'132.60',                    u'open': u'132.60',                    u'price': u'131.85',                    u'vol.': u'-'},  u'may 18, 2016': {u'change %': u'-1.93%',                    u'high': u'129.65',                    u'low': u'128.15',                    u'open': u'128.85',                    u'price': u'129.30',                    u'vol.': u'0.02k'},  u'may 19, 2016': {u'change %': u'-4.14%',                    u'high': u'129.00',                    u'low': u'123.70',                    u'open': u'128.95',                    u'price': u'123.95',                    u'vol.': u'29.69k'},  u'may 20, 2016': {u'change %': u'0.61%',                    u'high': u'125.95',                    u'low': u'124.25',                    u'open': u'124.75',                    u'price': u'124.70',                    u'vol.': u'15.54k'},  u'may 23, 2016': {u'change %': u'-2.04%',                    u'high': u'124.70',                    u'low': u'122.00',                    u'open': u'124.50',                    u'price': u'122.15',                    u'vol.': u'15.89k'},  u'may 24, 2016': {u'change %': u'-0.29%',                    u'high': u'123.30',                    u'low': u'121.55',                    u'open': u'122.45',                    u'price': u'121.80',                    u'vol.': u'15.06k'},  u'may 25, 2016': {u'change %': u'-0.33%',                    u'high': u'122.95',                    u'low': u'121.20',                    u'open': u'122.45',                    u'price': u'121.40',                    u'vol.': u'18.11k'},  u'may 26, 2016': {u'change %': u'0.08%',                    u'high': u'122.15',                    u'low': u'121.20',                    u'open': u'121.90',                    u'price': u'121.50',                    u'vol.': u'19.27k'},  u'may 27, 2016': {u'change %': u'-0.16%',                    u'high': u'122.35',                    u'low': u'120.80',                    u'open': u'122.10',                    u'price': u'121.30',                    u'vol.': u'13.52k'},  u'may 31, 2016': {u'change %': u'0.21%',                    u'high': u'123.90',                    u'low': u'121.35',                    u'open': u'121.55',                    u'price': u'121.55',                    u'vol.': u'23.62k'}} 

now specific dates, need mimic , ajax call post http://www.investing.com/instruments/historicaldataajax:

from bs4 import beautifulsoup collections import ordereddict  # data post data = {"action": "historical_data",         "curr_id": "8832",         "st_date": "04/04/2016",         "end_date": "04/08/2016",         "interval_sec": "daily"}  # add user agent , specify making ajax request head = {          "user-agent": "mozilla/5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, gecko) chrome/50.0.2661.75 safari/537.36",         "x-requested-with": "xmlhttprequest"}  requests.session() s:     r = s.post("http://www.investing.com/instruments/historicaldataajax", data=data, headers=head)     od = ordereddict()     soup = beautifulsoup(r.content, "lxml")      table = soup.select_one("table.gentbl.closedtbl.historicaltbl")        cols = [th.text th in table.select("th")][1:]     row in table.select("tr + tr"):         data = [td.text td in row.select("td")]         od[data[0]] = dict(zip(cols, data[1:]))  pprint import pprint pp  pp(dict(od)) 

now date range st_date end_date:

{u'apr 04, 2016': {u'change %': u'-3.50%',                    u'high': u'126.55',                    u'low': u'122.30',                    u'open': u'125.80',                    u'price': u'122.80',                    u'vol.': u'25.18k'},  u'apr 05, 2016': {u'change %': u'-1.55%',                    u'high': u'122.85',                    u'low': u'120.55',                    u'open': u'122.85',                    u'price': u'120.90',                    u'vol.': u'25.77k'},  u'apr 06, 2016': {u'change %': u'0.50%',                    u'high': u'122.15',                    u'low': u'120.00',                    u'open': u'121.45',                    u'price': u'121.50',                    u'vol.': u'17.94k'},  u'apr 07, 2016': {u'change %': u'-1.40%',                    u'high': u'122.60',                    u'low': u'119.60',                    u'open': u'122.35',                    u'price': u'119.80',                    u'vol.': u'32.69k'}} 

you can see post requests in chrome developer tools under xhr tab:

enter image description here