Extracting Tables from PDFs

March 20, 2025    Development F# AI

Extracting Tables from PDFs

I needed to extract error code tables from a large PDF system documentation with 266 pages. I was then going to insert these into a database for reference.

My first attempt was to use MS CoPilot, but it couldn’t handle it “We’re sorry. Copilot is temporarily unable to summarize documents of this size directly from your tab. However, you may be able to get summaries or ask questions about documents of this size by uploading them to Copilot using the file attachment icon in the input box.”. Later I came back after I published this post and uploaded the file, then asked it to “can you extract the error codes from the table as sql inserts”. It did create a lot of rows for me to use, but it stopped before it got all of them. Maybe with more prompts I would have gotten all of them. This was a giant 700 page document.

I could have just copy pasted the error codes out of it, but since I had extra time I decided to experiment with Windsurf IDE free trial and AI to generate F# code. It did a pretty good job and saved me a lot of time and attempts. I didn’t want to spend time writing this code that would be thrown away.

Windsurf IDE

I am impressed by the ability to interact with the code, run commands, auto fixing of linting and errors, it’s understanding of containers and how quickly it works. I’ve been using VS Code and GitHub Copilot (albeit not a whole lot), but this Agenic approach feels like a step above that. You still have to think about what ou need to do and verify the code, but for prototyping this is a game changer. I didn’t imaginehaving a tool like this even 3 years ago.

I used 11/40 User Prompt Premium credits and 31/200 Flow credits through this process on my free trial.

Something to be aware of is that if you got Windsurf licenses and still wanted to do C# development, you’d still need a VS license to use C# Dev Kit or not have debugging and nice tools - https://github.com/VSCodium/vscodium/blob/master/docs/index.md#visual-studio-marketplace .

I started with the prompt: Create a F# program to extract tables from a PDF document and went through a few interations with the Windsurf IDE AI.

Attempts

iText7

It’s first suggestion was to use iText7, that failed to find the tables.

I didn’t dig deeper into iText7, but the AI tried hard with a custom ‘TableExtractionStrategy` trying to use line and EndPoints seems hacky (but I could see myself trying that if I kept trying)

FSharpData

I asked it for FSharpData, which it found pdf2htmlex and ran in a container, the html was good and looked like the PDF, but no tables to key off, there were only divs.

I ended up running the pdf2htmlEX command in the terminal as it didn’t complete in this code. I didn’t take the time to figure out what the AI had wrong.

podman pull docker.io/bwits/pdf2htmlex

podman run -ti --rm -v C:\pdfs:/pdf docker.io/bwits/pdf2htmlex pdf2htmlEX UserGuide.pdf

This will create the UserGuid.html file.

namespace PdfTableExtractor

module PodmanExtractor =
    open FSharp.Data
    open System.IO
    open System.Diagnostics

    // Function to convert PDF to HTML using pdf2htmlEX via Podman
    let convertPdfToHtml (pdfPath: string) =
        let pdfDir = Path.GetDirectoryName(pdfPath)
        let pdfName = Path.GetFileName(pdfPath)
        let htmlName = Path.ChangeExtension(pdfName, ".html")
        
        printfn "PDF Directory: %s" pdfDir
        printfn "PDF Name: %s" pdfName
        printfn "HTML Name: %s" htmlName
        
        // podman run -ti --rm -v C:\avera:/pdf docker.io/bwits/pdf2htmlex pdf2htmlEX UserGuide.pdf
        // use proc = new Process()
        // proc.StartInfo.FileName <- "podman"
        // proc.StartInfo.Arguments <- sprintf "run -ti --rm -v \"%s\":/pdf docker.io/bwits/pdf2htmlex pdf2htmlEX --zoom 1.3 %s %s" pdfDir pdfName htmlName
        
        // proc.StartInfo.RedirectStandardError <- true
        // proc.StartInfo.RedirectStandardOutput <- true
        // proc.StartInfo.UseShellExecute <- false
        
        // // Start the process
        // printfn "Starting process with command: %s %s" proc.StartInfo.FileName proc.StartInfo.Arguments
        // proc.Start() |> ignore
        
        // // Read and display output
        // let output = proc.StandardOutput.ReadToEnd()
        // let error = proc.StandardError.ReadToEnd()
        
        // printfn "Process Output:\n%s" output
        // printfn "Process Error:\n%s" error
        
        // proc.WaitForExit()
        
        let outputPath = Path.Combine(pdfDir, htmlName)
        Some outputPath
        // printfn "Expected output path: %s" outputPath
        
        // if proc.ExitCode = 0 && File.Exists(outputPath) then 
        //     printfn "Conversion successful"
        //     Some outputPath 
        // else 
        //     printfn "Conversion failed with exit code: %d" proc.ExitCode
        //     None

    // Function to extract tables from HTML
    let extractTablesFromHtml (htmlPath: string) =
        let html = HtmlDocument.Load(htmlPath)
        html.Descendants ["table"]
        |> Seq.mapi (fun i table ->
            let rows = 
                table.Descendants ["tr"]
                |> Seq.map (fun row ->
                    row.Descendants ["td"; "th"]
                    |> Seq.map (fun cell -> cell.InnerText().Trim())
                    |> String.concat "\t")
                |> String.concat "\n"
            sprintf "Table %d:\n%s\n" (i + 1) rows)
        |> String.concat "\n"

    // Main function to process a PDF file
    let processPdf pdfPath =
        if File.Exists(pdfPath) then
            try
                printfn "Converting PDF to HTML using pdf2htmlEX (via Podman)..."
                match convertPdfToHtml pdfPath with
                | Some htmlPath ->
                    let tables = extractTablesFromHtml htmlPath
                    // Clean up the temporary HTML file
                    File.Delete htmlPath
                    Ok tables
                | None ->
                    Error "Error converting PDF to HTML. Make sure Podman is installed and running."
            with
            | ex -> Error $"Error processing PDF: {ex.Message}"
        else
            Error "File does not exist"

module Program =
    [<EntryPoint>]
    let main argv =
        match argv with
        | [|pdfPath|] ->
            match PodmanExtractor.processPdf pdfPath with
            | Ok tables ->
                printfn "Extracted tables:\n%s" tables
                0
            | Error msg ->
                eprintfn "%s" msg
                1
        | _ ->
            eprintfn "Usage: PdfTableExtractor <path-to-pdf>"
            1

Camelot Python

I asked it to use Camelot python, but didn’t go all the way of running that in a container (I didn’t want to install python). That looks straightforward, but I haven’t used Python much.


def extract_tables_from_pdf(pdf_path):
    """
    Extract tables from a PDF file using Camelot.
    
    Args:
        pdf_path (str): Path to the PDF file
        
    Returns:
        list: List of pandas DataFrames containing the extracted tables
    """
    # Extract tables using Camelot
    tables = camelot.read_pdf(pdf_path, pages='all')
    
    # Convert Camelot tables to pandas DataFrames
    dataframes = []
    for table in tables:
        df = table.df
        dataframes.append(df)
    
    return dataframes

def main():
    # Example usage
    pdf_path = "c:/pdfs/UserGuide.pdf"  # Replace with your PDF path
    
    try:
        tables = extract_tables_from_pdf(pdf_path)
        print(f"Extracted {len(tables)} tables from the PDF")
        
        # Save each table to a CSV file
        for i, table in enumerate(tables):
            table.to_csv(f"table_{i+1}.csv", index=False)
            print(f"Saved table {i+1} to table_{i+1}.csv")
    except Exception as e:
        print(f"Error extracting tables: {str(e)}")

if __name__ == "__main__":
    main()

AI

AI will probably solve this problem. Here’s a recent article on the subject .

Excel

When those failed and I ran out of time, I found https://nanonets.com/blog/extract-tables-from-pdf/

It turns out Excel is the easiest way using the Data tab and From PDF.

Conclusion

This was a good experiment. I enjoyed generating the code and learning from the Windsurf’s Cascade AI suggestions and output. I would definitely use this more in the future.

Sometimes the manual way is the best and quickest way.



Watch the Story for Good News
I gladly accept BTC Lightning Network tips at aligned@bitrefill.me

Please consider using Brave and adding me to your BAT payment ledger. Then you won't have to see ads! (when I get to $100 in Google Ads for a payout (I'm at $97.66!), I pledge to turn off ads)

Use Brave

Also check out my Resources Page for referrals that would help me.


Swan logo
Use Swan Bitcoin to onramp with low fees and automatic daily cost averaging and get $10 in BTC when you sign up.