Ho una pagina html che consiste in una tabella & Voglio recuperare tutti i valori in td, tr in quella tabella.
Ho provato a lavorare con beautifulsoup ma ora volevo lavorare su lxml o parser HML con python.parsing tabella HTML usando python - HTMLparser o lxml
Ho allegato l'esempio.
Voglio recuperare valori come liste di tuple come
[
[(value of 2050 jan, value of main subject-part1-sub part1-subject1), (value of 2050 feb, value of main subject-part1-sub part1-subject1),... ],
[(value of 2050 jan, value of main subject-part1-sub part1-subject2), (value of 2050 feb, value of main subject-part1-sub part1-subject2)... ]
]
e così via.
Qualcuno può farmi sapere come posso elaborarlo in modo molto "ottimale" usando il parser Python lxml o HTML?
esempio: test.html
<HTML>
<HEAD>
<TITLE>Title</TITLE>
</HEAD>
<BODY>
<TABLE BORDER>
<TR ALIGN=LEFT>
<TH COLSPAN=38>Main Subject</TH>
</TR>
<TR ALIGN=LEFT>
<TH COLSPAN=2> </TH>
<TH VALIGN=TOP COLSPAN=18>part1</TH>
<TH VALIGN=TOP COLSPAN=18>part2</TH>
</TR>
<TR ALIGN=LEFT>
<TH COLSPAN=2> </TH>
<TH VALIGN=TOP COLSPAN=9>sub-part1</TH>
<TH VALIGN=TOP COLSPAN=9>sub-part2</TH>
<TH VALIGN=TOP COLSPAN=9>sub-part3</TH>
<TH VALIGN=TOP COLSPAN=9>sub-part4</TH>
</TR>
<TR ALIGN=LEFT>
<TH COLSPAN=2> </TH>
<TH VALIGN=TOP COLSPAN=1>subject1</TH>
<TH VALIGN=TOP COLSPAN=1>subject2</TH>
<TH VALIGN=TOP COLSPAN=1>subject10</TH>
<TH VALIGN=TOP COLSPAN=1>subject11</TH>
<TH VALIGN=TOP COLSPAN=1>subject12</TH>
<TH VALIGN=TOP COLSPAN=1>subject13</TH>
<TH VALIGN=TOP COLSPAN=1>subject14</TH>
<TH VALIGN=TOP COLSPAN=1>subject15</TH>
<TH VALIGN=TOP COLSPAN=1>subject16</TH>
<TH VALIGN=TOP COLSPAN=1>subject17</TH>
<TH VALIGN=TOP COLSPAN=1>subject18</TH>
<TH VALIGN=TOP COLSPAN=1>subject19</TH>
<TH VALIGN=TOP COLSPAN=1>subject20</TH>
<TH VALIGN=TOP COLSPAN=1>subject21</TH>
<TH VALIGN=TOP COLSPAN=1>subject22</TH>
<TH VALIGN=TOP COLSPAN=1>subject23</TH>
<TH VALIGN=TOP COLSPAN=1>subject24</TH>
<TH VALIGN=TOP COLSPAN=1>subject25</TH>
<TH VALIGN=TOP COLSPAN=1>subject26</TH>
<TH VALIGN=TOP COLSPAN=1>subject27</TH>
<TH VALIGN=TOP COLSPAN=1>subject28</TH>
<TH VALIGN=TOP COLSPAN=1>subject29</TH>
<TH VALIGN=TOP COLSPAN=1>subject30</TH>
<TH VALIGN=TOP COLSPAN=1>subject31</TH>
<TH VALIGN=TOP COLSPAN=1>subject32</TH>
<TH VALIGN=TOP COLSPAN=1>subject33</TH>
<TH VALIGN=TOP COLSPAN=1>subject34</TH>
<TH VALIGN=TOP COLSPAN=1>subject35</TH>
<TH VALIGN=TOP COLSPAN=1>subject36</TH>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT VALIGN=TOP ROWSPAN=12>2050</TH>
<TH ALIGN=LEFT>January</TH>
<TD>0</TD>
<TD>1</TD>
<TD>3</TD>
<TD>0</TD>
<TD>4</TD>
<TD>16</TD>
<TD>0</TD>
<TD>6</TD>
<TD>2</TD>
<TD>2</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>3</TD>
<TD>2</TD>
<TD>0</TD>
<TD>26</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>5</TD>
<TD>6</TD>
<TD>0</TD>
<TD>8</TD>
<TD>2</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>February</TH>
<TD>1</TD>
<TD>0</TD>
<TD>8</TD>
<TD>0</TD>
<TD>2</TD>
<TD>4</TD>
<TD>1</TD>
<TD>6</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
<TD>25</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>4</TD>
<TD>14</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>March</TH>
<TD>0</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
<TD>4</TD>
<TD>7</TD>
<TD>0</TD>
<TD>9</TD>
<TD>2</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>9</TD>
<TD>0</TD>
<TD>45</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>10</TD>
<TD>16</TD>
<TD>0</TD>
<TD>5</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>April</TH>
<TD>1</TD>
<TD>0</TD>
<TD>5</TD>
<TD>0</TD>
<TD>3</TD>
<TD>12</TD>
<TD>1</TD>
<TD>11</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>0</TD>
<TD>3</TD>
<TD>2</TD>
<TD>34</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>6</TD>
<TD>18</TD>
<TD>1</TD>
<TD>3</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>5</TD>
<TD>1</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>May</TH>
<TD>7</TD>
<TD>0</TD>
<TD>6</TD>
<TD>0</TD>
<TD>8</TD>
<TD>4</TD>
<TD>1</TD>
<TD>13</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>2</TD>
<TD>0</TD>
<TD>1</TD>
<TD>7</TD>
<TD>1</TD>
<TD>30</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>5</TD>
<TD>12</TD>
<TD>0</TD>
<TD>4</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>6</TD>
<TD>1</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>June</TH>
<TD>0</TD>
<TD>1</TD>
<TD>14</TD>
<TD>0</TD>
<TD>7</TD>
<TD>15</TD>
<TD>0</TD>
<TD>17</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>5</TD>
<TD>0</TD>
<TD>1</TD>
<TD>3</TD>
<TD>0</TD>
<TD>24</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>5</TD>
<TD>0</TD>
<TD>6</TD>
<TD>13</TD>
<TD>1</TD>
<TD>9</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>1</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>July</TH>
<TD>0</TD>
<TD>1</TD>
<TD>6</TD>
<TD>0</TD>
<TD>8</TD>
<TD>17</TD>
<TD>1</TD>
<TD>15</TD>
<TD>2</TD>
<TD>1</TD>
<TD>0</TD>
<TD>10</TD>
<TD>0</TD>
<TD>2</TD>
<TD>15</TD>
<TD>2</TD>
<TD>53</TD>
<TD>0</TD>
<TD>3</TD>
<TD>3</TD>
<TD>6</TD>
<TD>0</TD>
<TD>7</TD>
<TD>16</TD>
<TD>0</TD>
<TD>9</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>August</TH>
<TD>2</TD>
<TD>0</TD>
<TD>5</TD>
<TD>0</TD>
<TD>8</TD>
<TD>15</TD>
<TD>1</TD>
<TD>17</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
<TD>5</TD>
<TD>16</TD>
<TD>0</TD>
<TD>33</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>11</TD>
<TD>0</TD>
<TD>2</TD>
<TD>25</TD>
<TD>4</TD>
<TD>8</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>September</TH>
<TD>2</TD>
<TD>0</TD>
<TD>10</TD>
<TD>0</TD>
<TD>16</TD>
<TD>22</TD>
<TD>2</TD>
<TD>19</TD>
<TD>4</TD>
<TD>2</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>8</TD>
<TD>0</TD>
<TD>27</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>8</TD>
<TD>0</TD>
<TD>11</TD>
<TD>31</TD>
<TD>1</TD>
<TD>9</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>1</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>October</TH>
<TD>3</TD>
<TD>1</TD>
<TD>8</TD>
<TD>0</TD>
<TD>4</TD>
<TD>28</TD>
<TD>0</TD>
<TD>15</TD>
<TD>2</TD>
<TD>1</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>1</TD>
<TD>6</TD>
<TD>0</TD>
<TD>15</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
<TD>9</TD>
<TD>26</TD>
<TD>1</TD>
<TD>8</TD>
<TD>4</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>November</TH>
<TD>0</TD>
<TD>3</TD>
<TD>3</TD>
<TD>0</TD>
<TD>6</TD>
<TD>23</TD>
<TD>1</TD>
<TD>8</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>3</TD>
<TD>7</TD>
<TD>1</TD>
<TD>20</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>8</TD>
<TD>0</TD>
<TD>3</TD>
<TD>18</TD>
<TD>3</TD>
<TD>7</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
</TR>
<TR ALIGN=RIGHT>
<TH ALIGN=LEFT>December</TH>
<TD>1</TD>
<TD>0</TD>
<TD>4</TD>
<TD>0</TD>
<TD>4</TD>
<TD>13</TD>
<TD>2</TD>
<TD>15</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>2</TD>
<TD>0</TD>
<TD>1</TD>
<TD>2</TD>
<TD>0</TD>
<TD>29</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>7</TD>
<TD>0</TD>
<TD>3</TD>
<TD>20</TD>
<TD>1</TD>
<TD>13</TD>
<TD>0</TD>
<TD>1</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>0</TD>
<TD>3</TD>
<TD>0</TD>
</TR>
</TABLE>
</BODY>
</HTML>
+1 per una domanda benedetto che cerca di utilizzare lo strumento adeguato per HTML parsing – bernie