Come estrarre i nomi di file da un campo che contiene contenuto html nel server sql?

Abbiamo un sistema cms che scrive blocchi di contenuto html nel database del server SQL. Conosco il nome della tabella e il nome del campo in cui risiedono questi blocchi di contenuto HTML. Alcuni html contengono collegamenti() a file PDF. Ecco un frammento:Come estrarre i nomi di file da un campo che contiene contenuto html nel server sql?

<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf" 
target="_blank">list</a>.</p>

Ho bisogno di estrarre i nomi di file pdf da tutti questi blocchi di contenuto html. Alla fine ho bisogno di ottenere un elenco:

Tuition-Reimbursement-Deferred.pdf 
Some-other-file.pdf

di tutti i nomi di file pdf da quel campo.

Qualsiasi aiuto è apprezzato. Grazie.

UPDATE

Ho ricevuto molte risposte, grazie mille, ma ho dimenticato di dire che stiamo ancora usando SQL Server 2000 qui. Quindi, questo doveva essere fatto usando SQL 2000 SQL.

fonte

2013-04-25 monstro

Saranno tutti hanno una cartella che precede il nome del file? –

Devi farlo in TSQL? È un linguaggio molto scarso per l'analisi del testo, e sarebbe molto più semplice in una lingua diversa che ha una libreria di analisi HTML. – Pondlife

D'accordo, questa è solo una rapida opzione. Se necessario, userò HTMLAgilityPack nel mio codice C# per analizzare il contenuto. – monstro

Beh non è abbastanza, ma questo funziona utilizzando standard di Transact-SQL:

SELECT CASE WHEN CHARINDEX('.pdf', html) > 0 
      THEN SUBSTRING(
        html, 
        CHARINDEX('.pdf', html) - 
        PATINDEX(
         '%["/]%', 
         REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 1, 
        PATINDEX(
         '%["/]%', 
         REVERSE(SUBSTRING(html, 0, CHARINDEX('.pdf', html)))) + 3) 
      ELSE NULL 
     END AS filename 
FROM mytable

potrebbe espandere l'elenco dei caratteri che delimitano prima che il nome del file da ["/] (che corrisponde sia un segno di virgolette o barra) se ti piace.

Vedi SQL Fiddle demo

fonte

2013-04-25 21:43:46

Ottimo, ho dimenticato di dire che abbiamo SQL 2000 qui, quindi questo approccio lavori !! Molte grazie. – monstro

Creare questa funzione:

create function dbo.extract_filenames_from_a_tags (@s nvarchar(max)) 
returns @res table (pdf nvarchar(max)) as 
begin 
-- assumes there are no single quotes or double quotes in the PDF filename 
declare @i int, @j int, @k int, @tmp nvarchar(max); 
set @i = charindex(N'.pdf', @s); 
while @i > 0 
begin 
    select @tmp = left(@s, @i+3); 
    select @j = charindex('/', reverse(@tmp)); -- directory delimiter 
    select @k = charindex('"', reverse(@tmp)); -- start of href 
    if @j = 0 or (@k > 0 and @k < @j) set @j = @k; 
    select @k = charindex('''', reverse(@tmp)); -- start of href (single-quote*) 
    if @j = 0 or (@k > 0 and @k < @j) set @j = @k; 
    insert @res values (substring(@tmp, len(@tmp)[email protected]+2, len(@tmp))); 
    select @s = stuff(@s, 1, @i+4, ''); -- remove up to ".pdf" 
    set @i = charindex(N'.pdf', @s); 
end 
return 
end 
GO

Una demo sull'utilizzo di quella funzione:

declare @t table (html varchar(max)); 
insert @t values 
    (' 
<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf" 
target="_blank">list</a>.</p>'), 
    (' 
<p>A deferred tuition payment plan, 
or view the <a href="Two files here-Reimbursement-Deferred.pdf" 
target="_blank">list</a>.</p>And I use single quotes 
    <a href=''/look/path/The second file.pdf'' 
target="_blank">list</a>'); 

select t.*, p.pdf 
from @t t 
cross apply dbo.extract_filenames_from_a_tags(html) p;

Risultati:

|HTML     |          PDF | 
-------------------------------------------------------------------- 
|<p>A deferred tui.... |  Tuition-Reimbursement-Deferred.pdf | 
|<p>A deferred tui.... | Two files here-Reimbursement-Deferred.pdf | 
|<p>A deferred tui.... |      The second file.pdf |

SQL Fiddle Demo

fonte

2013-04-25 21:25:40 RichardTheKiwi

Questa è una funzione fantastica. –

Grazie mille, funziona perfettamente, ma ho dimenticato di menzionare che stiamo ancora usando SQL Server 2000 e questo codice non funzionerà su SQL 2000. – monstro

ne dite di trattamento che HTML come XML?

declare @t table (html varchar(max)); 
insert @t 
    select ' 
    <p>A deferred tuition payment plan, 
    or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf" 
    target="_blank">list</a>.</p>' 
    union all 
    select ' 
    <p>A deferred tuition payment plan, 
    or view the <a href="Two files here-Reimbursement-Deferred.pdf" 
    target="_blank">list</a>.</p>And I use single quotes 
     <a href=''/look/path/The second file.pdf'' 
    target="_blank">list</a>' 

select [filename] = reverse(left(reverse('/'+p.n.value('@href', 'varchar(100)')), charindex('/',reverse('/'+p.n.value('@href', 'varchar(100)')), 1) - 1)) 
from ( select cast(html as xml) 
      from @t 
     ) x(doc) 
cross 
apply doc.nodes('//a') p(n);

Risultati:

filename 
--------------------------------------------------------------- 
Tuition-Reimbursement-Deferred.pdf 
Two files here-Reimbursement-Deferred.pdf 
The second file.pdf

fonte

2013-04-25 22:06:27

provare questo -

DECLARE @XML XML = 
'<p>A deferred tuition payment plan, 
or view the <a href="/uploadedFiles/Tuition-Reimbursement-Deferred.pdf" 
target="_blank">list</a>.</p>' 

SELECT 
     ref_text = t.p.value('./a[1]', 'NVARCHAR(50)') 
    , ref_filename = REVERSE(
         LEFT(REVERSE(t.p.value('./a[1]/@href', 'NVARCHAR(50)')), 
         CHARINDEX('/',REVERSE(t.p.value('./a[1]/@href', 'NVARCHAR(50)')), 1) - 1)) 
FROM @XML.nodes('/p') t(p)

fonte

2013-04-26 05:25:23 Devart

Grazie mille, ma ho dimenticato di menzionare che stiamo ancora usando SQL Server 2000 qui, e non ha il tipo di dati XML :( – monstro

Come estrarre i nomi di file da un campo che contiene contenuto html nel server sql?

risposta

Problemi correlati